Page 13 - Statistics for Environmental Engineers
P. 13
L1592_frame_CH-01 Page 4 Tuesday, December 18, 2001 1:39 PM
Aberrant values. Values that stand out from the general trend are fairly common. They may occur
because of gross errors in sampling or measurement. They may be mistakes in data recording. If we think
only in these terms, it becomes too tempting to discount or throw out such values. However, rejecting
any value out of hand may lead to serious errors. Some early observers of stratospheric ozone concen-
trations failed to detect the hole in the ozone layer because their computer had been programmed to screen
incoming data for “outliers.” The values that defined the hole in the ozone layer were disregarded. This
is a reminder that rogue values may be real. Indeed, they may contain the most important information.
Censored data. Great effort and expense are invested in measurements of toxic and hazardous
substances that should be absent or else be present in only trace amounts. The analyst handles many
specimens for which the concentration is reported as “not detected” or “below the analytical method
detection limit.” This method of reporting censors the data at the limit of detection and condemns all
lower values to be qualitative. This manipulation of the data creates severe problems for the data analyst
and the person who needs to use the data to make decisions.
Large amounts of data (which are often observational data rather than data from designed experi-
ments). Every treatment plant, river basin authority, and environmental control agency has accumulated
a mass of multivariate data in filing cabinets or computer databases. Most of this is happenstance data.
It was collected for one purpose; later it is considered for another purpose. Happenstance data are
often ill suited for model building. They may be ill suited for detecting trends over time or for testing
any hypothesis about system behavior because (1) the record is not consistent and comparable from
period to period, (2) all variables that affect the system have not been observed, and (3) the range of
variables has been restricted by the system’s operation. In short, happenstance data often contain
surprisingly little information. No amount of analysis can extract information that does not exist.
Large measurement errors. Many biological and chemical measurements have large measurement
errors, despite the usual care that is taken with instrument calibration, reagent preparation, and personnel
training. There are efficient statistical methods to deal with random errors. Replicate measurements
can be used to estimate the random variation, averaging can reduce its effect, and other methods can
compare the random variation with possible real changes in a system. Systematic errors (bias) cannot
be removed or reduced by averaging.
Lurking variables. Sometimes important variables are not measured, for a variety of reasons. Such
variables are called lurking variables. The problems they can cause are discussed by Box (1966) and
Joiner (1981). A related problem occurs when a truly influential variable is carefully kept within a narrow
range with the result that the variable appears to be insignificant if it is used in a regression model.
Nonconstant variance. The error associated with measurements is often nearly proportional to the
magnitude of their measured values rather than approximately constant over the range of the measured
values. Many measurement procedures and instruments introduce this property.
Nonnormal distributions. We are strongly conditioned to think of data being symmetrically distributed
about their average value in the bell shape of the normal distribution. Environmental data seldom have
this distribution. A common asymmetric distribution has a long tail toward high values.
Serial correlation. Many environmental data occur as a sequence of measurements taken over time
or space. The order of the data is critical. In such data, it is common that the adjacent values are not
statistically independent of each other because the natural continuity over time (or space) tends to make
neighboring values more alike than randomly selected values. This property, called serial correlation,
violates the assumptions on which many statistical procedures are based. Even low levels of serial
correlation can distort estimation and hypothesis testing procedures.
Complex cause-and-effect relationships. The systems of interest — the real systems in the field — are
affected by dozens of variables, including many that cannot be controlled, some that cannot be measured
accurately, and probably some that are unidentified. Even if the known variables were all controlled, as
we try to do in the laboratory, the physics, chemistry, and biochemistry of the system are complicated
and difficult to decipher. Even a system that is driven almost entirely by inorganic chemical reactions
can be difficult to model (for example, because of chemical complexation and amorphous solids forma-
tion). The situation has been described by Box and Luceno (1997): “All models are wrong but some are
useful.” Our ambition is usually short of trying to discover all causes and effects. We are happy if we
can find a useful model.
© 2002 By CRC Press LLC