Page 42 - Statistics and Data Analysis in Geology
P. 42
Elementary Statistics
If observations with certain characteristics are systematically excluded from
the sample, deliberately or inadvertently, the sample is said to be biased. Suppose,
for example, we are interested in the porosity of a particular sandstone unit. If
we exclude all loose and crumbly rocks from our sample because their porosity is
difficult to measure, we will alter the results of the study. It is likely that the range
of porosities will be truncated at the high end, biasing the sample toward low values
and giving an erroneously low estimate of the variation in porosity within the unit.
Samples should be drawn from populations in a random manner. This means
that each item in the population has an equal opportunity to be included in the
sample. A random sample will be unbiased, and as the sample size is increased,
will provide an increasingly refined picture of the nature of the population. Unfor-
tunately, obtaining a truly random sample may be impractical, as in the situation of
sampling a geologic unit that is partially buried. Samples within the unit at depth
do not have the same opportunity of being chosen as samples at outcrops. The
problems of sampling in such circumstances are complex; some of the references
at the end of this chapter discuss the effects of various sampling schemes and the
relative merits of different sampling designs. However, many geologic problems
involve the analysis of data collected without prior design. The interpretation of
subsurface structure from drill-hole data is a prominent example.
Statistics
Distributions have certain characteristics, such as their midpoint; measures indicat-
ing the amount of "spread"; and measures of symmetry of the distribution. These
characteristics are known as parameters if they describe populations, and statistics
if they refer to samples. Statistics may be used to estimate parameters of parent
populations and to test hypotheses about populations.
Although summary statistics are important, sometimes we can learn more by
examining the distribution of the observations as shown on different plots and
graphs. A familiar form of display is the histogram, a bar chart in which a con-
tinuous variable is divided into discrete categories and the number or proportion
of observations that fall into each category is represented by the areas of the cor-
responding bars. (As we have already seen, histograms are useful for showing
discrete distributions but now we are interested in their application to continuous
variables.) Usually the limits of categories are chosen so all of the histogram in-
tervals will be the same width, so the heights of the bars also are proportional to
the numbers of observations within the categories represented by the bars. If the
vertical scale on the bar chart reads in number of observations, the graphic is called
a frequency histogram. If the number of observations in each category are divided
by the total number of observations, the scale reads in percent and the bar chart is
a relative frequency histogram. Since a histogram covers the entire range of obser-
vations, the sum of the areas of all the bars will represent either the total number
of observations or 100%. If the observations have been selected in an unbiased,
representative manner, the sample histogram can be considered an approximation
of the underlying probability distribution.
The appearance of a histogram is strongly affected by our choice of the number
of categories and the starting value of the first category, especially if the sample
contains only a few observations. Dividing the data into a small number of cate-
gories increases the average number in each and the histogram will be relatively
29