Page 91 -
P. 91

HAN 09-ch02-039-082-9780123814791


          54    Chapter 2 Getting to Know Your Data          2011/6/1  3:15  Page 54  #16



                         equal to $60, while 25% of items sold at branch 2 were less than or equal to $64. At the
                         50th percentile (marked by the median, which is also Q 2 ), we see that 50% of items
                         sold at branch 1 were less than $78, while 50% of items at branch 2 were less than $85.
                         In general, we note that there is a shift in the distribution of branch 1 with respect to
                         branch 2 in that the unit prices of items sold at branch 1 tend to be lower than those at
                         branch 2.
                         Histograms

                         Histograms (or frequency histograms) are at least a century old and are widely used.
                         “Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of
                         poles. Plotting histograms is a graphical method for summarizing the distribution of a
                         given attribute, X. If X is nominal, such as automobile model or item type, then a pole
                         or vertical bar is drawn for each known value of X. The height of the bar indicates the
                         frequency (i.e., count) of that X value. The resulting graph is more commonly known as
                         a bar chart.
                           If X is numeric, the term histogram is preferred. The range of values for X is parti-
                         tioned into disjoint consecutive subranges. The subranges, referred to as buckets or bins,
                         are disjoint subsets of the data distribution for X. The range of a bucket is known as
                         the width. Typically, the buckets are of equal width. For example, a price attribute with
                         a value range of $1 to $200 (rounded up to the nearest dollar) can be partitioned into
                         subranges 1 to 20, 21 to 40, 41 to 60, and so on. For each subrange, a bar is drawn with a
                         height that represents the total count of items observed within the subrange. Histograms
                         and partitioning rules are further discussed in Chapter 3 on data reduction.

           Example 2.15 Histogram. Figure 2.6 shows a histogram for the data set of Table 2.1, where buckets (or
                         bins) are defined by equal-width ranges representing $20 increments and the frequency
                         is the count of items sold.

                           Although histograms are widely used, they may not be as effective as the quantile
                         plot, q-q plot, and boxplot methods in comparing groups of univariate observations.

                         Scatter Plots and Data Correlation

                         A scatter plot is one of the most effective graphical methods for determining if there
                         appears to be a relationship, pattern, or trend between two numeric attributes. To con-
                         struct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic
                         sense and plotted as points in the plane. Figure 2.7 shows a scatter plot for the set of data
                         in Table 2.1.
                           The scatter plot is a useful method for providing a first look at bivariate data to see
                         clusters of points and outliers, or to explore the possibility of correlation relationships.
                         Two attributes, X, and Y, are correlated if one attribute implies the other. Correlations
                         can be positive, negative, or null (uncorrelated). Figure 2.8 shows examples of positive
                         and negative correlations between two attributes. If the plotted points pattern slopes
   86   87   88   89   90   91   92   93   94   95   96