Page 138 - The Art of Designing Embedded Systems
P. 138

Hardware Musings  125


                   ial receive routine still accepts characters and echoes them to the sender.
                   After all, the ISR by definition runs independently of the rest of the code,
                   so will often continue to function when other routines die. If  your WDT
                   tickler stays alive as the world collapses around the rest of the code, then
                   the watchdog serves no useful purpose.
                        This problem  multiplies  in  a system  with  an  RTOS, as a reliable
                   watchdog monitors all of the tasks. If some of the tasks die but others stay
                   alive-perhaps  tickling the WDT-then   the system’s operation is at best
                   degraded.
                        In this case write the WDT code as its own task, driven by a timer.
                   All other tasks send messages to the watchdog process, indicating “I’m
                   alive.” Only when the WDT activity sees that all tasks that should have
                   checked in are indeed operating does it service the watchdog. If  you use
                   RTOS-supplied messaging to communicate the tasks’ health-rather  than
                   dreaded  though  easy  global  variables-there’s   little  chance that  errant
                   code overwriting RAM can create a false indication that all’s OK.
                        Suppose the WDT does indeed find a fault and resets the CPU. Then
                   what? A simple reset and restart may not be safe or wise.
                        One system uses very high-energy gamma rays to measure the thick-
                   ness of steel. A hardware problem led to a series of watchdog time-outs. I
                   watched, aghast, as this system cycled through WDT resets about once a
                   second, each time opening the safety shield around the gamma ray source!
                   The technicians were understandably afraid to approach close enough to
                   yank the power cord.
                        If you cannot guarantee that the system will be safe after the watch-
                   dog fires, then you simply must add hardware to put it in a reasonable, non-
                   dangerous, mode.
                        Even units that have no safety issues suffer from poorly thought-out
                   WDT designs. A sensor company complained that their products were get-
                   ting slower. Over time, and with several thousand units in the field, re-
                   sponse time to user inputs degraded noticeably. A bit of research showed
                   that their system’s watchdog properly drove the CPU’s reset signal, and
                   the code then recognized a warm boot, going directly to the application
                   with no indication to the users that the time-out had occurred. We tracked
                   the problem down to a floating input on the CPU that caused the software
                   to  crash-up   to  several  thousand  times  per  second.  The  processor
                   was spending most of  its time resetting, leading to apparently slow user
                   response.
                        If your system recovers automatically from a WDT time-out, add an
                    LED or status display so users-or at least the programmers!-know   that
   133   134   135   136   137   138   139   140   141   142   143