Page 132 - Statistics for Environmental Engineers
P. 132

L1592_Frame_C15  Page 129  Tuesday, December 18, 2001  1:50 PM









                       15




                       Analyzing Censored Data






                       KEY WORDS censored data, delta-lognormal distribution, median, limit of detection, trimmed mean,
                       Winsorized mean, probability plot, rankit, regression, order statistics.

                       Many important environmental problems focus on chemicals that are expected to exist at very low con-
                       centrations, or to be absent. Under these conditions, a set of data may include some observations that are
                       reported as “not detected” or “below the limit of detection (MDL).” Such a data set is said to be censored.
                        Censored data are, in essence, missing values. Missing values in data records are common and they
                       are not always a serious problem. If 50 specimens were collected and five of them, selected at random,
                       were damaged or lost, we could do the analysis as though there were only 45 observations. If a few
                       values are missing at random intervals from a time series, they can be filled in without seriously distorting
                       the pattern of the series. The difficulty with censored data is that missing values are not selected at
                       random. They are all missing at one end of the distribution. We cannot go ahead as if they never existed
                       because this would bias the final results.
                        The odd feature of censored water quality data is that the censored values were not always missing.
                       Some numerical value was measured, but the analytical chemist determined that the value was below
                       the method limit of detection (MDL) and reported <MDL instead of the number. A better practice is to
                       report all values along with a statement of their precision and let the data analyst decide what weight
                       the very low values should carry in the  final interpretation. Some laboratories do this, but there are
                       historical data records that have been censored and there are new censored data being produced. Methods
                       are needed to interpret these.
                        Unfortunately, there is no generally accepted scheme for replacing the censored observations with
                       some arbitrary values. Replacing censored observations with zero or 0.5 MDL gives estimates of the
                       mean that are biased low and estimates of the variance that are high. Replacing the censored values with
                       the MDL, or omitting the censored observations, gives estimates of the mean that are high and variance
                       that are low. The bias of both the mean and variance would increase as the fraction of observations
                       censored increases, or the MDL increases (Berthouex and Brown, 1994).
                        The median, trimmed mean, and Winsorized mean are three unbiased estimates of the mean for normal
                       or other symmetrical distributions. They are insensitive to information from the extremes of the distri-
                       bution and can be used when the extent of censoring is moderate (i.e., not more than 15 to 25%).
                       Graphical interpretation with probability plots is useful, especially when the degree of censoring is high.
                       Cohen’s maximum likelihood method for estimating the mean and variance is widely used when
                       censoring is 25% or less.



                       The Median

                       The median is an unbiased estimate of the mean of any symmetric distribution (e.g., the normal distri-
                       bution). The median is unaffected by the magnitude of observations on the tails of the distribution. It is
                       also unaffected by censoring so long as more than half of the observations have been quantified.
                        The median is the middle value in a ranked data set if the number of observations is odd. If the number
                       of observations is even, the two middle values are averaged to estimate the median. If more than half


                       © 2002 By CRC Press LLC
   127   128   129   130   131   132   133   134   135   136   137