Page 186 - The Art of Designing Embedded Systems
P. 186
Troubleshooting 173
Another example: suppose your system runs fine at 10 MHz but
never at 20. Obviously you’d put a 20-MHz clock source in and pursue the
problem. Every once in a while, go back to 10 MHz just to be sure the
symptom has not changed. You could spend a lot of time developing a
hypothesis about 20 versus 10 operation, when the 10-MHz test results
might actually be a fluke.
Assume nothing. Test everything. The PCB may have manufacturing
errors on internal layers. Power and ground may not be on the pins you ex-
pect-particularly on newer high-density SMT parts. Signals labeled with-
out an inversion bar may actually be active low. You might have ROMs
mixed up. Perhaps someone loaded the wrong parts on the board.
Never blindly trust your test equipment-know how each instrument
works and what its limitations are. If two signals seem impossibly skewed
by 15 nsec on the logic analyzer, make sure this is not an artifact of setting
it to sample too slowly. When your 100-MHz scope shows a perfectly
clean logic level, remember that undetected but virulent strains of 1-nsec
glitches can still be running merrily around your circuit.
When you do see a glitch, one that seems impossible given the
circuit design, remember that manufacturing shorts can do strange things
to signals. Is the part hot? A simple finger test may be a good short in-
dicator.
On its final spectacular descent to Mars in 1997, the Mars
Pathfinder spacecraft experienced a series of watchdog time-outs.
The robustly designed code recovered quickly, averting disaster.
Engineers later diagnosed and fixed the code, uploading
patches across 40 million miles of hostile vacuum. Interestingly
enough, they found that exactly the same WDT time-outs had been
noted during prelaunch testing, here on Earth. The testers had attrib-
uted the rare resets to “glitches” and ignored the problem.
Now, some “glitches” have physical manifestations. In one
system the timer chip went into an insane mode, where it would for
no apparent reason stop outputting pulses. The problem was a reset,
which I knew because only a reset-or magic (never to be dis-
counted)-could cause the problem.
The culprit was a glitch on the reset line, created by the fast
logic of the emulator’s pod driving the unmatched impedance of the
customer’s two-layer PC board. A simple resistor termination cured
the problem.

