Page 313 -
P. 313

296   Chapter 11   Dependability and security


                                    2.  Availability The probability that a system, at a point in time, will be operational
                                        and able to deliver the requested services.

                                       One of the practical problems in developing reliable systems is that our intuitive
                                    notions of reliability and availability are sometimes broader than these limited defi-
                                    nitions. The definition of reliability states that the environment in which the system
                                    is used and the purpose that it is used for must be taken into account. If you measure
                                    system reliability in one environment, you can’t assume that the reliability will be
                                    the same if the system is used in a different way.
                                       For example, let’s say that you measure the reliability of a word processor in an
                                    office environment where most users are uninterested in the operation of the soft-
                                    ware. They follow the instructions for its use and do not try to experiment with the
                                    system. If you then measure the reliability of the same system in a university envi-
                                    ronment, then the reliability may be quite different. Here, students may explore the
                                    boundaries of the system and use the system in unexpected ways. This may result in
                                    system failures that did not occur in the more constrained office environment.
                                       These standard definitions of availability and reliability do not take into account
                                    the severity of failure or the consequences of unavailability. People often accept
                                    minor system failures but are very concerned about serious failures that have high
                                    consequential costs. For example, computer failures that corrupt stored data are less
                                    acceptable  than  failures  that  freeze  the  machine  and  that  can  be  resolved  by
                                    restarting the computer.
                                       A strict definition of reliability relates the system implementation to its specifica-
                                    tion. That is, the system is behaving reliably if its behavior is consistent with that
                                    defined in the specification. However, a common cause of perceived unreliability is
                                    that the system specification does not match the expectations of the system users.
                                    Unfortunately, many specifications are incomplete or incorrect and it is left to soft-
                                    ware engineers to interpret how the system should behave. As they are not domain
                                    experts, they may not, therefore, implement the behavior that users expect. It is also
                                    true, of course, that users don’t read system specifications. They may therefore have
                                    unrealistic expectations of the system.
                                       Availability and reliability are obviously linked as system failures may crash the
                                    system. However, availability does not just depend on the number of system crashes,
                                    but  also  on  the  time  needed  to  repair  the  faults  that  have  caused  the  failure.
                                    Therefore, if system A fails once a year and system B fails once a month then A is
                                    clearly more reliable then B. However, assume that system A takes three days to
                                    restart after a failure, whereas system B takes 10 minutes to restart. The availability
                                    of system B over the year (120 minutes of down time) is much better than that of
                                    system A (4,320 minutes of down time).
                                       The disruption caused by unavailable systems is not reflected in the simple avail-
                                    ability metric that specifies the percentage of time that the system is available. The
                                    time when the system fails is also significant. If a system is unavailable for an hour
                                    each day between 3 am and 4 am, this may not affect many users. However, if the
                                    same  system  is  unavailable  for  10  minutes  during  the  working  day,  system
                                    unavailability will probably have a much greater effect.
   308   309   310   311   312   313   314   315   316   317   318