Page 69 - Statistics for Environmental Engineers
P. 69

L1592_frame_C07.fm  Page 61  Tuesday, December 18, 2001  1:44 PM








                       7




                       Using Transformations






                       KEY WORDS antilog, arcsin, bacterial counts, Box-Cox transformation, cadmium, confidence inter-
                       val, geometric mean, transformations, linearization, logarithm, nonconstant variance, plankton counts,
                       power function, reciprocal, square root, variance stabilization.

                       There is usually no scientific reason why we should insist on analyzing data in their original scale of
                       measurement. Instead of doing our analysis on y it may be more appropriate to look at log(y),  y,  1/y,
                       or some other function of y. These re-expressions of y are called transformations. Properly used trans-
                       formations eliminate distortions and give each observation equal power to inform.
                        Making a transformation is not cheating. It is a common scientific practice for presenting and inter-
                                                                           +
                                                                         [
                       preting data. A pH meter reads in logarithmic units, pH =  – log 10 H ]   and not in hydrogen ion concen-
                       tration units. The instrument makes a data transformation that we accept as natural. Light absorbency
                       is measured on a logarithmic scale by a spectrophotometer and converted to a concentration with the
                       aid of a calibration curve.  The calibration curve makes a transformation that is accepted without
                       hesitation. If we are dealing with bacterial counts, N, we think just as well in terms of log(N ) as N itself.
                        There are three technical reasons for sometimes doing the calculations on a transformed scale: (1) to
                       make the spread equal in different data sets (to make the variances uniform); (2) to make the distribution
                                                                                                1
                       of the residuals normal; and (3) to make the effects of treatments additive (Box et al., 1978).  Equal
                       variance means having equal spread at the different settings of the independent variables or in the different
                       data sets that are compared. The requirement for a normal distribution applies to the measurement errors
                       and not to the entire sample of data. Transforming the data makes it possible to satisfy these requirements
                       when they are not satisfied by the original measurements.



                       Transformations for Linearization
                       Transformations are sometimes used to obtain a straight-line relationship between two variables. This
                       may involve, for example, using reciprocals, ratios, or logarithms. The left-hand panel of Figure 7.1 shows
                       the exponential growth of bacteria. Notice that the variance (spread) of the counts increases as the population
                       density increases. The right-hand panel shows that the data can be described by a straight line when plotted
                       on a log scale. Plotting on a log scale is equivalent to making a log transformation of the data.
                        The important characteristic of the original data is the nonconstant variance, not nonlinearity. This is
                       a problem when the curve or line is fitted to the data using regression. Regression tries to minimize the
                       distance between the data points and the line described by the model. Points that are far from the line
                       exert a strong effect because the regression mathematics wants to reduce the square of this distance. The result
                       is that the precisely measured points at time t = 1 will have less influence on the position of the regression
                       line than the poorly measured data at t = 3. This gives too much influence to the least reliable data. We
                       would prefer for each data point to have about the same amount of influence on the location of the line.
                       In this example, the log-transformed data have constant variance at the different population levels. Each data



                       1             a b
                       For example, if y = x z , a log transformation gives log y = a log x + b log z. Now the effects of factors x and z are additive.
                       See Box et al. (1978) for an example of how this can be useful.

                       © 2002 By CRC Press LLC
   64   65   66   67   68   69   70   71   72   73   74