Page 81 -
P. 81

HAN
                                09-ch02-039-082-9780123814791

          44    Chapter 2 Getting to Know Your Data          2011/6/1  3:15 Page 44  #6



                         coordinates (e.g., when clustering houses), and monetary quantities (e.g., you are 100
                         times richer with $100 than with $1).

                   2.1.6 Discrete versus Continuous Attributes

                         In our presentation, we have organized attributes into nominal, binary, ordinal, and
                         numeric types. There are many ways to organize attribute types. The types are not
                         mutually exclusive.
                           Classification algorithms developed from the field of machine learning often talk of
                         attributes as being either discrete or continuous. Each type may be processed differently.
                         A discrete attribute has a finite or countably infinite set of values, which may or may not
                         be represented as integers. The attributes hair color, smoker, medical test, and drink size
                         each have a finite number of values, and so are discrete. Note that discrete attributes
                         may have numeric values, such as 0 and 1 for binary attributes or, the values 0 to 110 for
                         the attribute age. An attribute is countably infinite if the set of possible values is infinite
                         but the values can be put in a one-to-one correspondence with natural numbers. For
                         example, the attribute customer ID is countably infinite. The number of customers can
                         grow to infinity, but in reality, the actual set of values is countable (where the values can
                         be put in one-to-one correspondence with the set of integers). Zip codes are another
                         example.
                           If an attribute is not discrete, it is continuous. The terms numeric attribute and con-
                         tinuous attribute are often used interchangeably in the literature. (This can be confusing
                         because, in the classic sense, continuous values are real numbers, whereas numeric val-
                         ues can be either integers or real numbers.) In practice, real values are represented
                         using a finite number of digits. Continuous attributes are typically represented as
                         floating-point variables.


                 2.2     Basic Statistical Descriptions of Data


                         For data preprocessing to be successful, it is essential to have an overall picture of your
                         data. Basic statistical descriptions can be used to identify properties of the data and
                         highlight which data values should be treated as noise or outliers.
                           This section discusses three areas of basic statistical descriptions. We start with mea-
                         sures of central tendency (Section 2.2.1), which measure the location of the middle or
                         center of a data distribution. Intuitively speaking, given an attribute, where do most of
                         its values fall? In particular, we discuss the mean, median, mode, and midrange.
                           In addition to assessing the central tendency of our data set, we also would like to
                         have an idea of the dispersion of the data. That is, how are the data spread out? The most
                         common data dispersion measures are the range, quartiles, and interquartile range; the
                         five-number summary and boxplots; and the variance and standard deviation of the data
                         These measures are useful for identifying outliers and are described in Section 2.2.2.
                           Finally, we can use many graphic displays of basic statistical descriptions to visually
                         inspect our data (Section 2.2.3). Most statistical or graphical data presentation software
   76   77   78   79   80   81   82   83   84   85   86