Page 373 - A Practical Guide from Design Planning to Manufacturing
P. 373
Silicon Debug and Test 343
frequency, voltage, and temperature of the processor must be swept
through the full range allowed. Extremes of the manufacturing process
are tested by creating “skew” wafer lots where chips are intentionally fab-
ricated at the edges of the normal range of variation. 7
For any bug, the first step after detection is to confirm the symptom
as being caused by a real bug by trying to reproduce the failure. Early
in post-silicon validation, silicon test may not be able to fully screen out
manufacturing defects. This means a simple broken wire, rather than a
design flaw, may be causing a failure symptom. Reproducing the same
symptom with other parts rules this out. If the bug can be reproduced
in the RTL model, this vastly simplifies analysis. This requires picking
out a small enough segment of the failing test that can be simulated in
RTL in a reasonable time. Once reproduced in simulation, the values of
all the RTL nodes are available, giving far more information than can
be measured from the silicon. Unfortunately many bugs cannot be repro-
duced in simulation because they would require too long a test sequence
or they are the result of circuit problems. In any case, if a bug is confirmed
on multiple parts, the next task is finding a work-around.
It is important that the root cause of each bug is ultimately deter-
mined, but in the short term what is more important is that the search
for more bugs continues. It’s extremely unlikely that any particular bug
will be the last one found, and a bug that prevents tests from running
successfully may be hiding many more bugs. This is especially true for
bugs that prevent the processor from completing reset or booting oper-
ating systems. These bugs may prevent any post-silicon validation at all
being performed until a work-around is identified.
A common work-around is turning off some of the processor func-
tionality. If branch prediction or the top-level cache are faulty and are
preventing the processor from completing the reset sequence, these fea-
tures might be disabled temporarily. The processor will execute exceed-
ingly slowly but validation work can continue. Temporary updates to the
BIOS or operating system may allow a bug to be avoided. Failing these,
further testing may be limited to only certain platforms or certain volt-
age and temperature conditions. In the worst case, some tests may have
to be avoided until a work-around is found or the bug is fixed.
Finding bugs is in many ways easier after first silicon. Many more test
cycles can be run than were possible in simulation. However, finding the root
cause of bugs is far more difficult. Pre-silicon tools typically look for one type
of problem at a time. Separate simulations are used to check for logic bugs,
speedpaths, and circuit marginalities. In the real world, these may all inter-
act to produce sometimes baffling behavior. Locating the specific design
7
Josephson, “Design Methodology for the McKinley Processor.”

