Page 61 - Statistics for Environmental Engineers
P. 61
L1592_Frame_C05 Page 53 Tuesday, December 18, 2001 1:42 PM
Comments
We are almost always interested in knowing the shape of a sample’s distribution. Often it is important
to know whether a set of data is distributed symmetrically about a central value, or whether there is a
tail of data toward a high or a low value. It may be important to know what fraction of time a critical
value is exceeded.
Dot plots and probability plots are useful graphical tools for seeing the shape of a distribution. To
avoid misinterpreting probability plots, use them only in conjunction with other plots. Make dot diagrams
and, if the data are sequential in time, a time series plot. Sometimes these graphs provide all the important
information and the probability plot is unnecessary.
Probability plots are convenient for estimating percentile values, especially the median (50th percen-
tile) and extreme values. It is not necessary for the probability plot to be a straight line to do this. If it
is straight, draw a straight line. But if it is not straight, draw a smooth curve through the plotted points
and go ahead with the estimation.
Do not use probability plots to estimate the mean and standard deviation except in the very special
case when the data give a linear plot on normal probability paper. This special case is common in
textbooks, but rare with real environmental data. If the data plot as a straight line on log-probability
paper, the 50th percentile value is not the mean (it is the geometric mean) and there is no distance that
can be measured on the plot to estimate the standard deviation.
Probability plots may be useful in discovering the distribution of the data in a sample. Sometimes the
analysis is not clear-cut. Because of random sampling variation, the curve can have a substantial amount
of “wiggle” when the data actually are normally distributed. When the number of observations approaches
50, the shape of the probability distribution becomes much more clear than when the sample is small
(for example, 20 observations). Hahn and Shapiro (1967) point out that:
1. The variance of points in the tails (extreme low or high plotted values) will be larger than
that of points at the center of the distribution. Thus, the relative linearity of the plot near the
tails of the distribution will often seem poorer than at the center even if the correct model
for the probability density distribution has been chosen.
2. The plotted points are ordered and hence are not independent. Thus, we should not expect
them to be randomly scattered about a line. For example, the points immediately following
a point above the line are also likely to be above the line. Even if the chosen model is correct,
the plot may consist of a series of successive points (known as runs) above and below the line.
3. A model can never be proven to be adequate on the basis of sample data. Thus, the probability
of a small sample taken from a near-normal distribution will frequently not differ appreciably
from that of a sample from a normal distribution.
If the data have positive skew, it is often convenient to use graph paper that has a log scale on one
axis and a normal probability scale on the other axis. If the logarithms of the data are normally distributed,
this kind of graph paper will produce a straight-line probability plot. The log scale may provide a
convenient scaling for the graph even if it does not produce a straight-line plot; for example, when the
data are bacterial counts that range from 10 to 100,000.
References
Hahn, G. J. and S. S. Shapiro (1967). Statistical Methods for Engineers, New York, John Wiley.
Hirsch, R. M. and J. D. Stedinger (1987). “Plotting Positions for Historical Floods and Their Precision,” Water
Resources Research, 23(4), 715–727.
Mage, D. T. (1982). “An Objective Graphical Method for Testing Normal Distributional Assumptions Using
Probability Plots,” Am. Statistician, 36, 116–120.
© 2002 By CRC Press LLC