Page 372 - A Practical Guide from Design Planning to Manufacturing
P. 372

342   Chapter Eleven

          The detection of a post-silicon bug often begins with a validation system
                                                                 ®
        showing the blue screen of death (BSOD). When the Windows operat-
        ing system detects an error from which it cannot recover, it displays a
        warning message on a solid blue screen before restarting. Any processor
        with a bug triggered by the operating system may fail in a similar way.
        The same type of failure is also produced by an application hitting a bug
        that causes corruption of the operating system’s instructions or data.
          Another common symptom of a bug is a hung system. If functional unit
        A of the processor is waiting for a result from functional unit B, and B
        is at the same time waiting for A, the processor can stop execution alto-
        gether. This condition is called deadlock. Similarly unit A may produce
        a result that causes unit B to reproduce its own result. If this causes unit
        A to also reproduce its result, the processor again becomes stuck. In this
        case, the processor appears busy with lots of instructions being executed,
        but no real progress is being made. This condition is called livelock.
        Correct designs will have mechanisms for avoiding or getting out of
        deadlock or livelock conditions, but these fail-safes may not work for the
        unexpected circumstances created by design flaws. As a result, silicon
        bugs often appear as a hung processor.
          A bug may appear as a test that completes successfully but does not
        produce the expected results. Detecting these bugs requires that the cor-
        rect results for the test have been created using a compatible processor,
        hardware emulation, or software simulation. Performance or power
        bugs may appear as a test that produces the correct results but uses
        much more time or power than expected.
          Any of these symptoms must be treated as a sign of a possible silicon
        bug. Of course, every time Windows crashes or hangs, this does not nec-
        essarily mean that the processor design is faulty. Software bugs in the
        operating system or even the test itself often cause these types of symp-
        toms. The processor may be correctly executing a test that because of a soft-
        ware bug does not actually do what was intended. The test may violate the
        architecture of the processor in some way. For example, a flawed random
        test generator might produce an invalid instruction. The processor attempt-
        ing to execute this test may produce a different result than the previous
        generation processor, but the architecture may not guarantee any partic-
        ular behavior when executing nonsense code.
          In addition to software problems, hardware design flaws outside the
        processor may also cause bugs. A chipset or motherboard flaw never trig-
        gered by the previous generation processor might appear as a processor
        bug. The most difficult bugs to analyze are those caused by interactions
        between multiple software and hardware problems. If the processor will be
        used with multiple operating systems or hardware configurations, then
        each of these possible systems must be tested. Bugs caused by circuit
        marginalities will often appear sporadically. To search for these bugs, the
   367   368   369   370   371   372   373   374   375   376   377