Page 186 - The Art of Designing Embedded Systems
P. 186

Troubleshooting  173

                         Another example:  suppose  your  system runs  fine  at  10 MHz  but
                    never at 20. Obviously you’d put a 20-MHz clock source in and pursue the
                    problem.  Every once in a while, go back to  10 MHz just to be  sure the
                    symptom has not changed. You  could spend a lot of  time developing a
                    hypothesis  about 20 versus  10 operation, when the  10-MHz test results
                    might actually be a fluke.
                         Assume nothing. Test everything. The PCB may have manufacturing
                    errors on internal layers. Power and ground may not be on the pins you ex-
                    pect-particularly  on newer high-density SMT parts. Signals labeled with-
                    out an inversion bar may actually be active low. You might have ROMs
                    mixed up. Perhaps someone loaded the wrong parts on the board.
                         Never blindly trust your test equipment-know  how each instrument
                    works and what its limitations are. If two signals seem impossibly skewed
                    by  15 nsec on the logic analyzer, make sure this is not an artifact of setting
                    it  to sample too slowly. When  your  100-MHz scope shows a perfectly
                    clean logic level, remember that undetected but virulent strains of  1-nsec
                    glitches can still be running merrily around your circuit.
                         When  you  do see  a  glitch, one that  seems  impossible  given  the
                    circuit design, remember that manufacturing shorts can do strange things
                    to signals. Is the part hot? A simple finger test may be a good short in-
                    dicator.


                            On  its  final spectacular descent to Mars  in  1997, the  Mars
                       Pathfinder spacecraft experienced  a series of watchdog time-outs.
                       The robustly designed code recovered quickly, averting disaster.
                            Engineers  later  diagnosed  and  fixed  the  code,  uploading
                       patches  across 40 million  miles  of  hostile  vacuum. Interestingly
                       enough, they found that exactly the same WDT time-outs had been
                       noted during prelaunch testing, here on Earth. The testers had attrib-
                       uted the rare resets to “glitches” and ignored the problem.
                            Now,  some “glitches” have  physical  manifestations.  In  one
                       system the timer chip went into an insane mode, where it would for
                       no apparent reason stop outputting pulses. The problem was a reset,
                       which  I  knew  because  only a  reset-or  magic  (never to  be  dis-
                       counted)-could cause the problem.
                            The culprit was a glitch on the reset line, created by the fast
                       logic of the emulator’s pod driving the unmatched impedance of the
                       customer’s two-layer PC board. A simple resistor termination cured
                       the problem.
   181   182   183   184   185   186   187   188   189   190   191