Page 112 - The Art of Designing Embedded Systems

P. 112

Firmware Musings 99

Some embedded systems are pretty tolerant of memory problems. We
hear of NASA spacecraft from time to time whose core or RAM develops
a few bad bits, yet somehow the engineers patch their code to operate
around the faulty areas, uploading the corrections over the distances of bil-
lions of miles.
Most of us work on systems with far less human intervention. There
are no teams of highly trained personnel anxiously monitoring the health
of each part of our products. It’s our responsibility to build a system that
works properly when the hardware is functional.
In some applications, though, a certain amount of self-diagnosis ei-
ther makes sense or is required; critical life-support applications should use
every diagnostic concept possible to avoid disaster due to a submicron
RAM imperfection.
So, the first rule about diagnostics in general, and RAM tests in par-
ticular, is to clearly define your goals. Why run the test? What will the re-
sult be? Who will be the unlucky recipient of the bad news in the event an
error is found, and what do you expect that person to do?
Will a RAM problem kill someone? If so, a very comprehensive test.
run regularly, is mandatory.
Is such a failure merely a nuisance? For instance, if it keeps a cell
phone from booting, if there’s nothing the customer can do about the fail-
ure anyway, then perhaps there’s no reason for doing a test. As a consumer
I could care less why the damn phone stopped working . . . if it’s dead, I’ll
take it in for repair or replacement.
Is production test-or even engineering test-the real motivation for
writing diagnostic code? If so, then define exactly what problems you’re
looking for and write code that will find those sorts of troubles.
Next, inject a dose of reality into your evaluation. Remember that
today’s hardware is often very highly integrated. In the case of a micro-
controller with on-board RAM, the chances of a memory failure that does-
n’t also kill the CPU is small. Again, if the system is a critical life-support
application it may indeed make sense to run a test, as even a minuscule
probability of a fault may spell disaster.
Does it make sense to ignore RAM failures? If your CPU has an il-
legal instruction trap, there’s a pretty good chance that memory prob-
lems will cause a code crash you can capture and process. If the chip
includes protection mechanisms (like the x86 protected mode), count on
bad stack reads immediately causing protection faults your handlers can
process. Perhaps RAM tests are simply not required, given these extra
resources.

107 108 109 110 111 112 113 114 115 116 117