Page 372 - A Practical Guide from Design Planning to Manufacturing

P. 372

342 Chapter Eleven

The detection of a post-silicon bug often begins with a validation system
®
showing the blue screen of death (BSOD). When the Windows operat-
ing system detects an error from which it cannot recover, it displays a
warning message on a solid blue screen before restarting. Any processor
with a bug triggered by the operating system may fail in a similar way.
The same type of failure is also produced by an application hitting a bug
that causes corruption of the operating system’s instructions or data.
Another common symptom of a bug is a hung system. If functional unit
A of the processor is waiting for a result from functional unit B, and B
is at the same time waiting for A, the processor can stop execution alto-
gether. This condition is called deadlock. Similarly unit A may produce
a result that causes unit B to reproduce its own result. If this causes unit
A to also reproduce its result, the processor again becomes stuck. In this
case, the processor appears busy with lots of instructions being executed,
but no real progress is being made. This condition is called livelock.
Correct designs will have mechanisms for avoiding or getting out of
deadlock or livelock conditions, but these fail-safes may not work for the
unexpected circumstances created by design flaws. As a result, silicon
bugs often appear as a hung processor.
A bug may appear as a test that completes successfully but does not
produce the expected results. Detecting these bugs requires that the cor-
rect results for the test have been created using a compatible processor,
hardware emulation, or software simulation. Performance or power
bugs may appear as a test that produces the correct results but uses
much more time or power than expected.
Any of these symptoms must be treated as a sign of a possible silicon
bug. Of course, every time Windows crashes or hangs, this does not nec-
essarily mean that the processor design is faulty. Software bugs in the
operating system or even the test itself often cause these types of symp-
toms. The processor may be correctly executing a test that because of a soft-
ware bug does not actually do what was intended. The test may violate the
architecture of the processor in some way. For example, a flawed random
test generator might produce an invalid instruction. The processor attempt-
ing to execute this test may produce a different result than the previous
generation processor, but the architecture may not guarantee any partic-
ular behavior when executing nonsense code.
In addition to software problems, hardware design flaws outside the
processor may also cause bugs. A chipset or motherboard flaw never trig-
gered by the previous generation processor might appear as a processor
bug. The most difficult bugs to analyze are those caused by interactions
between multiple software and hardware problems. If the processor will be
used with multiple operating systems or hardware configurations, then
each of these possible systems must be tested. Bugs caused by circuit
marginalities will often appear sporadically. To search for these bugs, the

367 368 369 370 371 372 373 374 375 376 377