Page 314 -

P. 314

11.2 Availability and reliability 297

Term Description

Human error or mistake Human behavior that results in the introduction of faults into a system. For
example, in the wilderness weather system, a programmer might decide that the
way to compute the time for the next transmission is to add 1 hour to the
current time. This works except when the transmission time is between 23.00
and midnight (midnight is 00.00 in the 24-hour clock).

System fault A characteristic of a software system that can lead to a system error. The fault is
the inclusion of the code to add 1 hour to the time of the last transmission,
without a check if the time is greater than or equal to 23.00.

System error An erroneous system state that can lead to system behavior that is unexpected
by system users. The value of transmission time is set incorrectly (to 24.XX rather
than 00.XX) when the faulty code is executed.

System failure An event that occurs at some point in time when the system does not deliver
a service as expected by its users. No weather data is transmitted because the
time is invalid.

System reliability and availability problems are mostly caused by system failures.
Figure 11.3
Reliability terminology Some of these failures are a consequence of specification errors or failures in other
related systems such as a communications system. However, many failures are a
consequence of erroneous system behavior that derives from faults in the system.
When discussing reliability, it is helpful to use precise terminology and distinguish
between the terms ‘fault,’ ‘error,’ and ‘failure.’ I have defined these terms in
Figure 11.3 and have illustrated each definition with an example from the wilderness
weather system.
When an input or a sequence of inputs causes faulty code in a system to be exe-
cuted, an erroneous state is created that may lead to a software failure. Figure 11.4,
derived from Littlewood (1990), shows a software system as a mapping of a set of
inputs to a set of outputs. Given an input or input sequence, the program responds by
producing a corresponding output. For example, given an input of a URL, a web
browser produces an output that is the display of the requested web page.
Most inputs do not lead to system failure. However, some inputs or input combi-
nations, shown in the shaded ellipse I in Figure 11.4, cause system failures or erro-
e
neous outputs to be generated. The program’s reliability depends on the number of
system inputs that are members of the set of inputs that lead to an erroneous output.
If inputs in the set I are executed by frequently used parts of the system, then fail-
e
ures will be frequent. However, if the inputs in I are executed by code that is rarely
e
used, then users will hardly ever see failures.
Because each user of a system uses it in different ways, they have different percep-
tions of its reliability. Faults that affect the reliability of the system for one user may
never be revealed under someone else’s mode of working (Figure 11.5). In Figure 11.5,
the set of erroneous inputs correspond to the ellipse labeled I in Figure 11.4. The set
e
of inputs produced by User 2 intersects with this erroneous input set. User 2 will

309 310 311 312 313 314 315 316 317 318 319