Page 373 - A Practical Guide from Design Planning to Manufacturing
P. 373

Silicon Debug and Test  343

        frequency, voltage, and temperature of the processor must be swept
        through the full range allowed. Extremes of the manufacturing process
        are tested by creating “skew” wafer lots where chips are intentionally fab-
        ricated at the edges of the normal range of variation. 7
          For any bug, the first step after detection is to confirm the symptom
        as being caused by a real bug by trying to reproduce the failure. Early
        in post-silicon validation, silicon test may not be able to fully screen out
        manufacturing defects. This means a simple broken wire, rather than a
        design flaw, may be causing a failure symptom. Reproducing the same
        symptom with other parts rules this out. If the bug can be reproduced
        in the RTL model, this vastly simplifies analysis. This requires picking
        out a small enough segment of the failing test that can be simulated in
        RTL in a reasonable time. Once reproduced in simulation, the values of
        all the RTL nodes are available, giving far more information than can
        be measured from the silicon. Unfortunately many bugs cannot be repro-
        duced in simulation because they would require too long a test sequence
        or they are the result of circuit problems. In any case, if a bug is confirmed
        on multiple parts, the next task is finding a work-around.
          It is important that the root cause of each bug is ultimately deter-
        mined, but in the short term what is more important is that the search
        for more bugs continues. It’s extremely unlikely that any particular bug
        will be the last one found, and a bug that prevents tests from running
        successfully may be hiding many more bugs. This is especially true for
        bugs that prevent the processor from completing reset or booting oper-
        ating systems. These bugs may prevent any post-silicon validation at all
        being performed until a work-around is identified.
          A common work-around is turning off some of the processor func-
        tionality. If branch prediction or the top-level cache are faulty and are
        preventing the processor from completing the reset sequence, these fea-
        tures might be disabled temporarily. The processor will execute exceed-
        ingly slowly but validation work can continue. Temporary updates to the
        BIOS or operating system may allow a bug to be avoided. Failing these,
        further testing may be limited to only certain platforms or certain volt-
        age and temperature conditions. In the worst case, some tests may have
        to be avoided until a work-around is found or the bug is fixed.
          Finding bugs is in many ways easier after first silicon. Many more test
        cycles can be run than were possible in simulation. However, finding the root
        cause of bugs is far more difficult. Pre-silicon tools typically look for one type
        of problem at a time. Separate simulations are used to check for logic bugs,
        speedpaths, and circuit marginalities. In the real world, these may all inter-
        act to produce sometimes baffling behavior. Locating the specific design



          7
           Josephson, “Design Methodology for the McKinley Processor.”
   368   369   370   371   372   373   374   375   376   377   378