Page 314 -
P. 314

11.2   Availability and reliability  297


                        Term                  Description

                        Human error or mistake  Human behavior that results in the introduction of faults into a system. For
                                              example, in the wilderness weather system, a programmer might decide that the
                                              way to compute the time for the next transmission is to add 1 hour to the
                                              current time. This works except when the transmission time is between 23.00
                                              and midnight (midnight is 00.00 in the 24-hour clock).

                        System fault          A characteristic of a software system that can lead to a system error. The fault is
                                              the inclusion of the code to add 1 hour to the time of the last transmission,
                                              without a check if the time is greater than or equal to 23.00.

                        System error          An erroneous system state that can lead to system behavior that is unexpected
                                              by system users. The value of transmission time is set incorrectly (to 24.XX rather
                                              than 00.XX) when the faulty code is executed.

                        System failure        An event that occurs at some point in time when the system does not deliver
                                              a service as expected by its users. No weather data is transmitted because the
                                              time is invalid.




                                         System reliability and availability problems are mostly caused by system failures.
                     Figure 11.3
                     Reliability terminology  Some of these failures are a consequence of specification errors or failures in other
                                       related systems such as a communications system. However, many failures are a
                                       consequence of erroneous system behavior that derives from faults in the system.
                                       When discussing reliability, it is helpful to use precise terminology and distinguish
                                       between  the  terms  ‘fault,’  ‘error,’  and  ‘failure.’  I  have  defined  these  terms  in
                                       Figure 11.3 and have illustrated each definition with an example from the wilderness
                                       weather system.
                                         When an input or a sequence of inputs causes faulty code in a system to be exe-
                                       cuted, an erroneous state is created that may lead to a software failure. Figure 11.4,
                                       derived from Littlewood (1990), shows a software system as a mapping of a set of
                                       inputs to a set of outputs. Given an input or input sequence, the program responds by
                                       producing a corresponding output. For example, given an input of a URL, a web
                                       browser produces an output that is the display of the requested web page.
                                         Most inputs do not lead to system failure. However, some inputs or input combi-
                                       nations, shown in the shaded ellipse I in Figure 11.4, cause system failures or erro-
                                                                     e
                                       neous outputs to be generated. The program’s reliability depends on the number of
                                       system inputs that are members of the set of inputs that lead to an erroneous output.
                                       If inputs in the set I are executed by frequently used parts of the system, then fail-
                                                       e
                                       ures will be frequent. However, if the inputs in I are executed by code that is rarely
                                                                             e
                                       used, then users will hardly ever see failures.
                                         Because each user of a system uses it in different ways, they have different percep-
                                       tions of its reliability. Faults that affect the reliability of the system for one user may
                                       never be revealed under someone else’s mode of working (Figure 11.5). In Figure 11.5,
                                       the set of erroneous inputs correspond to the ellipse labeled I in Figure 11.4. The set
                                                                                       e
                                       of inputs produced by User 2 intersects with this erroneous input set. User 2 will
   309   310   311   312   313   314   315   316   317   318   319