Page 342 -
P. 342

12.3   Reliability specification  325


                                       2.  It provides a basis for assessing when to stop testing a system. You stop when
                                          the system has achieved its required reliability level.
                                       3.  It is a means of assessing different design strategies intended to improve the reli-
                                          ability of a system. You can make a judgment about how each strategy might
                                          lead to the required levels of reliability.
                                       4.  If a regulator has to approve a system before it goes into service (e.g., all systems
                                          that are critical to flight safety on an aircraft are regulated), then evidence that
                                          a required reliability target has been met is important for system certification.


                                         To establish the required level of system reliability, you have to consider the asso-
                                       ciated losses that could result from a system failure. These are not simply financial
                                       losses, but also loss of reputation for a business. Loss of reputation means that cus-
                                       tomers will go elsewhere. Although the short-term losses from a system failure may
                                       be relatively small, the longer-term losses may be much more significant. For exam-
                                       ple, if you try to access an e-commerce site and find that it is unavailable, you may
                                       try to find what you want elsewhere rather than wait for the system to become avail-
                                       able. If this happens more than once, you will probably not shop at that site again.
                                         The problem with specifying reliability using metrics such as POFOD, ROCOF,
                                       and AVAIL is that it is possible to overspecify reliability and thus incur high devel-
                                       opment and validation costs. The reason for this is that system stakeholders find it
                                       difficult to translate their practical experience into quantitative specifications. They
                                       may think that a POFOD of 0.001 (1failure in 1,000 demands) represents a relatively
                                       unreliable system. However, as I have explained, if demands for a service are
                                       uncommon, it actually represents a very high level of reliability.
                                         If you specify reliability as a metric, it is obviously important to assess that the
                                       required level of reliability has been achieved. You do this assessment as part of sys-
                                       tem testing. To assess the reliability of a system statistically, you have to observe a
                                       number of failures. If you have, for example, a POFOD of 0.0001 (1 failure in
                                       10,000 demands), then you may have to design tests that make 50 or 60 thousand
                                       demands on a system and where several failures are observed. It may be practically
                                       impossible to design and implement this number of tests. Therefore, overspecifica-
                                       tion of reliability leads to very high testing costs.
                                         When you specify the availability of a system, you may have similar problems.
                                       Although a very high level of availability may seem to be desirable, most systems
                                       have very intermittent demand patterns (e.g., a business system will mostly be used
                                       during normal business hours) and a single availability figure does not really reflect
                                       user needs. You need high availability when the system is being used but not at other
                                       times. Depending, of course, on the type of system, there may be no real practical
                                       difference between an availability of 0.999 and an availability of 0.9999.
                                         A fundamental problem with overspecification is that it may be practically
                                       impossible to show that a very high level of reliability or availability has been
                                       achieved. For example, say a system was intended for use in a safety-critical appli-
                                       cation and was therefore required to never fail over its total lifetime. Assume that
                                       1,000 copies of the system are to be installed and the system is executed 1,000
   337   338   339   340   341   342   343   344   345   346   347