Page 42 - Statistics and Data Analysis in Geology
P. 42

Elementary Statistics

                 If  observations with certain characteristics are systematically excluded from
             the sample, deliberately or inadvertently, the sample is said to be biased. Suppose,
             for example, we  are interested in the porosity of  a particular sandstone unit.  If
             we exclude all loose and crumbly rocks from our sample because their porosity is
             difficult to measure, we will alter the results of the study. It is likely that the range
             of porosities will be truncated at the high end, biasing the sample toward low values
             and giving an erroneously low estimate of the variation in porosity within the unit.
                 Samples should be drawn from populations in a random manner. This means
             that each item in the population has an  equal opportunity to be included in the
             sample.  A random sample will be unbiased, and as the sample size is increased,
             will provide an increasingly refined picture of the nature of  the population. Unfor-
             tunately, obtaining a truly random sample may be impractical, as in the situation of
             sampling a geologic unit that is partially buried. Samples within the unit at depth
             do not have the same opportunity of  being chosen as samples at outcrops.  The
             problems of  sampling in such circumstances are complex; some of  the references
             at the end of  this chapter discuss the effects of various sampling schemes and the
             relative merits of  different sampling designs.  However, many geologic problems
             involve the analysis of  data collected without prior design.  The interpretation of
             subsurface structure from drill-hole data is a prominent example.


             Statistics

             Distributions have certain characteristics, such as their midpoint; measures indicat-
             ing the amount of  "spread"; and measures of  symmetry of  the distribution. These
             characteristics are known as parameters if they describe populations, and statistics
             if  they refer to samples.  Statistics may be used to estimate parameters of  parent
             populations and to test hypotheses about populations.
                 Although summary statistics are important, sometimes we can learn more by
             examining the distribution  of  the observations  as shown on different plots  and
             graphs.  A familiar form of  display is the histogram, a bar chart in which a con-
             tinuous variable is divided into discrete categories and the number or proportion
             of  observations that fall into each category is represented by the areas of  the cor-
             responding bars.  (As we have already seen, histograms  are useful  for  showing
             discrete distributions but now we are interested in their application to continuous
             variables.)  Usually the limits of  categories are chosen so all of  the histogram in-
             tervals will be the same width, so the heights of  the bars also are proportional to
             the numbers of  observations within the categories represented by the bars.  If  the
             vertical scale on the bar chart reads in number of observations, the graphic is called
              a frequency histogram. If the number of  observations in each category are divided
             by the total number of  observations, the scale reads in percent and the bar chart is
              a relative frequency histogram. Since a histogram covers the entire range of obser-
             vations, the sum of  the areas of  all the bars will represent either the total number
              of  observations or  100%. If  the observations have been selected in an unbiased,
             representative manner, the sample histogram can be considered an approximation
              of  the underlying probability distribution.
                  The appearance of a histogram is strongly affected by our choice of the number
              of  categories and the starting value of  the first category, especially if the sample
              contains only a few observations.  Dividing the data into a small number of  cate-
              gories increases the average number in each and the histogram will be relatively

                                                                                       29
   37   38   39   40   41   42   43   44   45   46   47