Page 55 - Statistics for Environmental Engineers
P. 55
L1592_Frame_C05 Page 47 Tuesday, December 18, 2001 1:42 PM
5
Seeing the Shape of a Distribution
KEY WORDS dot diagram, histogram, probability distribution, cumulative probability distribution,
frequency diagram.
The data in a sample have some frequency distribution, perhaps symmetrical or perhaps skewed. The
statistics (mean, variance, etc.) computed from these data also have some distribution. For example, if the
problem is to establish a 95% confidence interval on the mean, it is not important that the sample is normally
distributed because the distribution of the mean tends to be normal regardless of the sample’s distribution.
In contrast, if the problem is to estimate how frequently a certain value will be exceeded, it is essential to
base the estimate on the correct distribution of the sample. This chapter is about the shape of the distribution
of the data in the sample and not the distribution of statistics computed from the sample.
Many times the first analysis done on a set of data is to compute the mean and standard deviation. These
two statistics fully characterize a normal distribution. They do not fully describe other distributions. We
should not assume that environmental data will be normally distributed. Experience shows that stream quality
data, wastewater treatment plant influent and effluent data, soil properties, and air quality data typically do
not have normal distributions. They are more likely to have a long tail skewed toward high values (positive
skewness). Fortunately, one need not assume the distribution. It can be discovered from the data.
Simple plots help reveal the sample’s distribution. Some of these plots have already been discussed
in Chapters 2 and 3. Dot diagrams are particularly useful. These simple plots have been overlooked and
underused. Environmental engineering references are likely to advise, by example if not by explicit
advice, the construction of a probability plot (also known as the cumulative frequency plot). Probability
plots can be useful. Their construction and interpretation and the ways in which such plots can be
misused will be discussed.
Case Study: Industrial Waste Survey Data Analysis
The BOD (5-day) data given in Table 5.1 were obtained from an industrial wastewater survey (U.S. EPA,
1973). There are 99 observations, each measured on a 4-hr composite sample, giving six observations
daily for 16 days, plus three observations on the 17th day. The survey was undertaken to estimate the
average BOD and to estimate the concentration that is exceeded some small fraction of the time (for
example, 10%). This information is needed to design a treatment process. The pattern of variation also
needs to be seen because it will influence the feasibility of using an equalization process to reduce the
variation in BOD loading. The data may have other interesting properties, so the data presentation should
be complete, clear, and not open to misinterpretation.
Dot Diagrams
Figure 5.1 is a time series plot of the data. The concentration fluctuates rapidly with more or less equal
variation above and below the average, which is 687 mg/L. The range is from 207 to 1185 mg/L. The
BOD may change by 1000 mg/L from one sampling interval to the next. It is not clear whether the ups
and downs are random or are part of some cyclic pattern. There is little else to be seen from this plot.
© 2002 By CRC Press LLC