Page 118 -
P. 118

3:15
                                                                           Page 81
                                                             2011/6/1
                                                                                   #43
                          HAN 09-ch02-039-082-9780123814791
                                                                           2.7 Bibliographic Notes  81


                               (c) Numeric attributes
                              (d) Term-frequency vectors

                           2.6 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
                               (a) Compute the Euclidean distance between the two objects.
                              (b) Compute the Manhattan distance between the two objects.
                               (c) Compute the Minkowski distance between the two objects, using q = 3.
                              (d) Compute the supremum distance between the two objects.
                           2.7 The median is one of the most important holistic measures in data analysis. Pro-
                               pose several methods for median approximation. Analyze their respective complexity
                               under different parameter settings and decide to what extent the real value can be
                               approximated. Moreover, suggest a heuristic strategy to balance between accuracy and
                               complexity and then apply it to all methods you have given.
                           2.8 It is important to define or select similarity measures in data analysis. However, there
                               is no commonly accepted subjective similarity measure. Results can vary depending on
                               the similarity measures used. Nonetheless, seemingly different similarity measures may
                               be equivalent after some transformation.
                                 Suppose we have the following 2-D data set:

                                                                A 1  A 2
                                                                1.5  1.7
                                                            x 1
                                                            x 2  2   1.9
                                                                1.6  1.8
                                                            x 3
                                                            x 4  1.2  1.5
                                                                1.5  1.0
                                                            x 5
                               (a) Consider the data as 2-D data points. Given a new data point, x = (1.4,1.6) as a
                                  query, rank the database points based on similarity with the query using Euclidean
                                  distance, Manhattan distance, supremum distance, and cosine similarity.
                              (b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean
                                  distance on the transformed data to rank the data points.



                       2.7     Bibliographic Notes


                               Methods for descriptive data summarization have been studied in the statistics literature
                               long before the onset of computers. Good summaries of statistical descriptive data min-
                               ing methods include Freedman, Pisani, and Purves [FPP07] and Devore [Dev95]. For
   113   114   115   116   117   118   119   120   121   122   123