Page 359 -
P. 359
342 Chapter 13 Dependability engineering
The use of software engineering techniques, better programming languages, and better
quality management has led to significant improvements in dependability for most
software. Nevertheless, system failures may still occur that affect the system’s avail-
ability or lead to incorrect results being produced. In some cases, these failures simply
cause minor inconvenience. System vendors may simply decide to live with these fail-
ures, without correcting the errors in their systems. However, in some systems, failure
can lead to loss of life or significant economic or reputational losses. These are known
as ‘critical systems’, for which a high level of dependability is essential.
Examples of critical systems include process control systems, protection systems
that shut down other systems in the event of failure, medical systems, telecommunica-
tions switches, and flight control systems. Special development tools and techniques
may be used to enhance the dependability of the software in a critical system. These
tools and techniques usually increase the costs of system development but they reduce
the risk of system failure and the losses that may result from such a failure.
Dependability engineering is concerned with the techniques that are used to
enhance the dependability of both critical and non-critical systems. These techniques
support three complementary approaches that are used in developing dependable
software:
1. Fault avoidance The software design and implementation process should use
approaches to software development that help avoid design and programming
errors and so minimize the number of faults that are likely to arise when the sys-
tem is executing. Fewer faults mean less chance of run-time failures.
2. Fault detection and correction The verification and validation processes are
designed to discover and remove faults in a program, before it is deployed for
operational use. Critical systems require very extensive verification and valida-
tion to discover as many faults as possible before deployment and to convince
the system stakeholders that the system is dependable. I cover this topic in
Chapter 15.
3. Fault tolerance The system is designed so that faults or unexpected system
behavior during execution are detected at run-time and are managed in such a
way that system failure does not occur. Simple approaches to fault tolerance
based on built-in run-time checking may be included in all systems. However,
more specialized fault-tolerance techniques (such as the use of fault-tolerant
system architectures) are generally only used when a very high level of system
availability and reliability is required.
Unfortunately, applying fault-avoidance, fault-detection, and fault-tolerance tech-
niques leads to a situation of diminishing returns. The cost of finding and removing the
remaining faults in a software system rises exponentially as program faults are discov-
ered and removed (Figure 13.1). As the software becomes more reliable, you need to
spend more and more time and effort to find fewer and fewer faults. At some stage,
even for critical systems, the costs of this additional effort become unjustifiable.