Page 112 - The Art of Designing Embedded Systems
P. 112

Firmware Musings  99


                         Some embedded systems are pretty tolerant of memory problems. We
                    hear of NASA spacecraft from time to time whose core or RAM develops
                    a few bad bits,  yet  somehow  the engineers patch  their  code to operate
                    around the faulty areas, uploading the corrections over the distances of bil-
                    lions of miles.
                         Most of us work on systems with far less human intervention. There
                    are no teams of highly trained personnel anxiously monitoring the health
                    of each part of our products. It’s our responsibility to build a system that
                    works properly when the hardware is functional.
                         In some applications, though, a certain amount of self-diagnosis ei-
                    ther makes sense or is required; critical life-support applications should use
                    every diagnostic concept possible  to avoid disaster due to a submicron
                    RAM imperfection.
                         So, the first rule about diagnostics in general, and RAM tests in par-
                    ticular, is to clearly define your goals. Why run the test? What will the re-
                    sult be? Who will be the unlucky recipient of the bad news in the event an
                    error is found, and what do you expect that person to do?
                         Will a RAM problem kill someone? If so, a very comprehensive test.
                    run regularly, is mandatory.
                         Is such a failure merely a nuisance? For instance, if it keeps a cell
                    phone from booting, if there’s nothing the customer can do about the fail-
                    ure anyway, then perhaps there’s no reason for doing a test. As a consumer
                    I could care less why the damn phone stopped working . . . if it’s dead, I’ll
                    take it in for repair or replacement.
                         Is production test-or even engineering test-the  real motivation for
                    writing diagnostic code? If  so, then define exactly what problems you’re
                    looking for and write code that will find those sorts of troubles.
                         Next, inject a dose of reality into your evaluation. Remember that
                    today’s hardware is often very highly integrated. In the case of a micro-
                    controller with on-board RAM, the chances of a memory failure that does-
                    n’t also kill the CPU is small. Again, if the system is a critical life-support
                    application it may indeed make sense to run a test, as even a minuscule
                    probability of a fault may spell disaster.
                         Does it make sense to ignore RAM failures? If your CPU has an il-
                    legal instruction  trap, there’s a pretty good chance that memory prob-
                    lems will cause a code crash you can capture and process. If  the chip
                    includes protection mechanisms (like the x86 protected mode), count on
                    bad stack reads immediately causing protection faults your handlers can
                    process. Perhaps RAM tests are simply not required, given these extra
                    resources.
   107   108   109   110   111   112   113   114   115   116   117