Page 77 -
P. 77

09-ch02-039-082-9780123814791
                          HAN

          40    Chapter 2 Getting to Know Your Data          2011/6/1  3:15 Page 40  #2



                         two attributes are mapped onto a 2-D grid) to more sophisticated methods such as tree-
                         maps (where a hierarchical partitioning of the screen is displayed based on the attribute
                         values). Data visualization techniques are described in Section 2.3.
                           Finally, we may want to examine how similar (or dissimilar) data objects are. For
                         example, suppose we have a database where the data objects are patients, described by
                         their symptoms. We may want to find the similarity or dissimilarity between individ-
                         ual patients. Such information can allow us to find clusters of like patients within the
                         data set. The similarity/dissimilarity between objects may also be used to detect out-
                         liers in the data, or to perform nearest-neighbor classification. (Clustering is the topic
                         of Chapters 10 and 11, while nearest-neighbor classification is discussed in Chapter 9.)
                         There are many measures for assessing similarity and dissimilarity. In general, such mea-
                         sures are referred to as proximity measures. Think of the proximity of two objects as a
                         function of the distance between their attribute values, although proximity can also be
                         calculated based on probabilities rather than actual distance. Measures of data proximity
                         are described in Section 2.4.
                           In summary, by the end of this chapter, you will know the different attribute types
                         and basic statistical measures to describe the central tendency and dispersion (spread)
                         of attribute data. You will also know techniques to visualize attribute distributions and
                         how to compute the similarity or dissimilarity between objects.

                 2.1     Data Objects and Attribute Types


                         Data sets are made up of data objects. A data object represents an entity—in a sales
                         database, the objects may be customers, store items, and sales; in a medical database, the
                         objects may be patients; in a university database, the objects may be students, professors,
                         and courses. Data objects are typically described by attributes. Data objects can also be
                         referred to as samples, examples, instances, data points, or objects. If the data objects are
                         stored in a database, they are data tuples. That is, the rows of a database correspond to
                         the data objects, and the columns correspond to the attributes. In this section, we define
                         attributes and look at the various attribute types.

                   2.1.1 What Is an Attribute?

                         An attribute is a data field, representing a characteristic or feature of a data object. The
                         nouns attribute, dimension, feature, and variable are often used interchangeably in the
                         literature. The term dimension is commonly used in data warehousing. Machine learning
                         literature tends to use the term feature, while statisticians prefer the term variable. Data
                         mining and database professionals commonly use the term attribute, and we do here
                         as well. Attributes describing a customer object can include, for example, customer ID,
                         name, and address. Observed values for a given attribute are known as observations. A set
                         of attributes used to describe a given object is called an attribute vector (or feature vec-
                         tor). The distribution of data involving one attribute (or variable) is called univariate.
                         A bivariate distribution involves two attributes, and so on.
   72   73   74   75   76   77   78   79   80   81   82