Page 91 -
P. 91
HAN 09-ch02-039-082-9780123814791
54 Chapter 2 Getting to Know Your Data 2011/6/1 3:15 Page 54 #16
equal to $60, while 25% of items sold at branch 2 were less than or equal to $64. At the
50th percentile (marked by the median, which is also Q 2 ), we see that 50% of items
sold at branch 1 were less than $78, while 50% of items at branch 2 were less than $85.
In general, we note that there is a shift in the distribution of branch 1 with respect to
branch 2 in that the unit prices of items sold at branch 1 tend to be lower than those at
branch 2.
Histograms
Histograms (or frequency histograms) are at least a century old and are widely used.
“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of
poles. Plotting histograms is a graphical method for summarizing the distribution of a
given attribute, X. If X is nominal, such as automobile model or item type, then a pole
or vertical bar is drawn for each known value of X. The height of the bar indicates the
frequency (i.e., count) of that X value. The resulting graph is more commonly known as
a bar chart.
If X is numeric, the term histogram is preferred. The range of values for X is parti-
tioned into disjoint consecutive subranges. The subranges, referred to as buckets or bins,
are disjoint subsets of the data distribution for X. The range of a bucket is known as
the width. Typically, the buckets are of equal width. For example, a price attribute with
a value range of $1 to $200 (rounded up to the nearest dollar) can be partitioned into
subranges 1 to 20, 21 to 40, 41 to 60, and so on. For each subrange, a bar is drawn with a
height that represents the total count of items observed within the subrange. Histograms
and partitioning rules are further discussed in Chapter 3 on data reduction.
Example 2.15 Histogram. Figure 2.6 shows a histogram for the data set of Table 2.1, where buckets (or
bins) are defined by equal-width ranges representing $20 increments and the frequency
is the count of items sold.
Although histograms are widely used, they may not be as effective as the quantile
plot, q-q plot, and boxplot methods in comparing groups of univariate observations.
Scatter Plots and Data Correlation
A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes. To con-
struct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic
sense and plotted as points in the plane. Figure 2.7 shows a scatter plot for the set of data
in Table 2.1.
The scatter plot is a useful method for providing a first look at bivariate data to see
clusters of points and outliers, or to explore the possibility of correlation relationships.
Two attributes, X, and Y, are correlated if one attribute implies the other. Correlations
can be positive, negative, or null (uncorrelated). Figure 2.8 shows examples of positive
and negative correlations between two attributes. If the plotted points pattern slopes