Page 316 -
P. 316
11.3 Safety 299
2. Errors are transient. A state variable may have an incorrect value caused by the
execution of faulty code. However, before this is accessed and causes a system fail-
ure, some other system input may be processed that resets the state to a valid value.
3. The system may include fault detection and protection mechanisms. These
ensure that the erroneous behavior is discovered and corrected before the sys-
tem services are affected.
Another reason why the faults in a system may not lead to system failures is that,
in practice, users adapt their behavior to avoid using inputs that they know cause
program failures. Experienced users ‘work around’ software features that they have
found to be unreliable. For example, I avoid certain features, such as automatic num-
bering in the word processing system that I used to write this book. When I used
auto-numbering, it often went wrong. Repairing the faults in unused features makes
no practical difference to the system reliability. As users share information on prob-
lems and work-arounds, the effects of software problems are reduced.
The distinction between faults, errors, and failures, explained in Figure 11.3,
helps identify three complementary approaches that are used to improve the reliabil-
ity of a system:
1. Fault avoidance Development techniques are used that either minimize the
possibility of human errors and/or that trap mistakes before they result in the
introduction of system faults. Examples of such techniques include avoiding
error-prone programming language constructs such as pointers and the use of
static analysis to detect program anomalies.
2. Fault detection and removal The use of verification and validation techniques
that increase the chances that faults will be detected and removed before the
system is used. Systematic testing and debugging is an example of a fault-
detection technique.
3. Fault tolerance These are techniques that ensure that faults in a system do not
result in system errors or that system errors do not result in system failures. The
incorporation of self-checking facilities in a system and the use of redundant
system modules are examples of fault tolerance techniques.
The practical application of these techniques is discussed in Chapter 13, which
covers techniques for dependable software engineering.
11.3 Safety
Safety-critical systems are systems where it is essential that system operation is
always safe; that is, the system should never damage people or the system’s environ-
ment even if the system fails. Examples of safety-critical systems include control