Page 138 - The Art of Designing Embedded Systems

P. 138

Hardware Musings 125

ial receive routine still accepts characters and echoes them to the sender.
After all, the ISR by definition runs independently of the rest of the code,
so will often continue to function when other routines die. If your WDT
tickler stays alive as the world collapses around the rest of the code, then
the watchdog serves no useful purpose.
This problem multiplies in a system with an RTOS, as a reliable
watchdog monitors all of the tasks. If some of the tasks die but others stay
alive-perhaps tickling the WDT-then the system’s operation is at best
degraded.
In this case write the WDT code as its own task, driven by a timer.
All other tasks send messages to the watchdog process, indicating “I’m
alive.” Only when the WDT activity sees that all tasks that should have
checked in are indeed operating does it service the watchdog. If you use
RTOS-supplied messaging to communicate the tasks’ health-rather than
dreaded though easy global variables-there’s little chance that errant
code overwriting RAM can create a false indication that all’s OK.
Suppose the WDT does indeed find a fault and resets the CPU. Then
what? A simple reset and restart may not be safe or wise.
One system uses very high-energy gamma rays to measure the thick-
ness of steel. A hardware problem led to a series of watchdog time-outs. I
watched, aghast, as this system cycled through WDT resets about once a
second, each time opening the safety shield around the gamma ray source!
The technicians were understandably afraid to approach close enough to
yank the power cord.
If you cannot guarantee that the system will be safe after the watch-
dog fires, then you simply must add hardware to put it in a reasonable, non-
dangerous, mode.
Even units that have no safety issues suffer from poorly thought-out
WDT designs. A sensor company complained that their products were get-
ting slower. Over time, and with several thousand units in the field, re-
sponse time to user inputs degraded noticeably. A bit of research showed
that their system’s watchdog properly drove the CPU’s reset signal, and
the code then recognized a warm boot, going directly to the application
with no indication to the users that the time-out had occurred. We tracked
the problem down to a floating input on the CPU that caused the software
to crash-up to several thousand times per second. The processor
was spending most of its time resetting, leading to apparently slow user
response.
If your system recovers automatically from a WDT time-out, add an
LED or status display so users-or at least the programmers!-know that

133 134 135 136 137 138 139 140 141 142 143