Page 132 - Statistics for Environmental Engineers
P. 132
L1592_Frame_C15 Page 129 Tuesday, December 18, 2001 1:50 PM
15
Analyzing Censored Data
KEY WORDS censored data, delta-lognormal distribution, median, limit of detection, trimmed mean,
Winsorized mean, probability plot, rankit, regression, order statistics.
Many important environmental problems focus on chemicals that are expected to exist at very low con-
centrations, or to be absent. Under these conditions, a set of data may include some observations that are
reported as “not detected” or “below the limit of detection (MDL).” Such a data set is said to be censored.
Censored data are, in essence, missing values. Missing values in data records are common and they
are not always a serious problem. If 50 specimens were collected and five of them, selected at random,
were damaged or lost, we could do the analysis as though there were only 45 observations. If a few
values are missing at random intervals from a time series, they can be filled in without seriously distorting
the pattern of the series. The difficulty with censored data is that missing values are not selected at
random. They are all missing at one end of the distribution. We cannot go ahead as if they never existed
because this would bias the final results.
The odd feature of censored water quality data is that the censored values were not always missing.
Some numerical value was measured, but the analytical chemist determined that the value was below
the method limit of detection (MDL) and reported <MDL instead of the number. A better practice is to
report all values along with a statement of their precision and let the data analyst decide what weight
the very low values should carry in the final interpretation. Some laboratories do this, but there are
historical data records that have been censored and there are new censored data being produced. Methods
are needed to interpret these.
Unfortunately, there is no generally accepted scheme for replacing the censored observations with
some arbitrary values. Replacing censored observations with zero or 0.5 MDL gives estimates of the
mean that are biased low and estimates of the variance that are high. Replacing the censored values with
the MDL, or omitting the censored observations, gives estimates of the mean that are high and variance
that are low. The bias of both the mean and variance would increase as the fraction of observations
censored increases, or the MDL increases (Berthouex and Brown, 1994).
The median, trimmed mean, and Winsorized mean are three unbiased estimates of the mean for normal
or other symmetrical distributions. They are insensitive to information from the extremes of the distri-
bution and can be used when the extent of censoring is moderate (i.e., not more than 15 to 25%).
Graphical interpretation with probability plots is useful, especially when the degree of censoring is high.
Cohen’s maximum likelihood method for estimating the mean and variance is widely used when
censoring is 25% or less.
The Median
The median is an unbiased estimate of the mean of any symmetric distribution (e.g., the normal distri-
bution). The median is unaffected by the magnitude of observations on the tails of the distribution. It is
also unaffected by censoring so long as more than half of the observations have been quantified.
The median is the middle value in a ranked data set if the number of observations is odd. If the number
of observations is even, the two middle values are averaged to estimate the median. If more than half
© 2002 By CRC Press LLC