Page 277 - Statistics for Environmental Engineers
P. 277
L1592_Frame_C31 Page 281 Tuesday, December 18, 2001 2:50 PM
31
Correlation
KEY WORDS BOD, COD, correlation, correlation coefficient, covariance, nonparametric correla-
2
tion, Pearson product-moment correlation coefficient, R , regression, serial correlation, Spearman rank
correlation coefficient, taste, chlorine.
Two variables have been measured and a plot of the data suggests that there is a linear relationship
between them. A statistic that quantifies the strength of the linear relationship between the two variables
is the correlation coefficient.
Care must be taken lest correlation is confused with causation. Correlation may, but does not neces-
sarily, indicate causation. Observing that y increases when x increases does not mean that a change in
x causes the increase in y. Both x and y may change as a result of change in a third variable, z.
Covariance and Correlation
A measure of the linear dependence between two variables x and y is the covariance between x and y.
The sample covariance of x and y is:
(
(
∑ x i η x ) y i η y )
–
–
(
Cov x, y) = ---------------------------------------------
N
where η x and η y are the population means of the variables x and y, and N is the size of the population. If x
and y are independent, Cov(x, y) would be zero. Note that the converse is not true. Finding Cov(x, y) = 0
does not mean they are independent. (They might be related by a quadratic or exponential function.)
The covariance is dependent on the scales chosen. Suppose that x and y are distances measured in inches.
If x is converted from inches to feet, the covariance would be divided by 12. If both x and y are converted
2
to feet, the covariance would be divided by 12 = 144. This makes it impossible in practice to know whether
a value of covariance is large, which would indicate a strong linear relation between two variables, or
small, which would indicate a weak association.
A scaleless covariance, called the correlation coefficient ρ(x, y), or simply ρ, is obtained by dividing
the covariance by the two population standard deviations σ x and σ y , respectively. The possible values
of ρ range from −1 to +1. If x is independent of y, ρ would be zero. Values approaching −1 or +1 indicate
a strong correspondence of x with y. A positive correlation (0 < ρ ≤ 1) indicates the large values of x
are associated with large values of y. In contrast, a negative correlation (−1 ≤ ρ < 0) indicates that large
values of x are associated with small values of y.
The true values of the population means and standard deviations are estimated from the available data
x
by computing the means and . The sample correlation coefficient between x and y is:y
(
∑ x i –( x) y i – y)
r = -----------------------------------------------------
∑ x i –( x) ∑ y i – y) 2
(
2
© 2002 By CRC Press LLC