Page 118 -
P. 118
3:15
Page 81
2011/6/1
#43
HAN 09-ch02-039-082-9780123814791
2.7 Bibliographic Notes 81
(c) Numeric attributes
(d) Term-frequency vectors
2.6 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using q = 3.
(d) Compute the supremum distance between the two objects.
2.7 The median is one of the most important holistic measures in data analysis. Pro-
pose several methods for median approximation. Analyze their respective complexity
under different parameter settings and decide to what extent the real value can be
approximated. Moreover, suggest a heuristic strategy to balance between accuracy and
complexity and then apply it to all methods you have given.
2.8 It is important to define or select similarity measures in data analysis. However, there
is no commonly accepted subjective similarity measure. Results can vary depending on
the similarity measures used. Nonetheless, seemingly different similarity measures may
be equivalent after some transformation.
Suppose we have the following 2-D data set:
A 1 A 2
1.5 1.7
x 1
x 2 2 1.9
1.6 1.8
x 3
x 4 1.2 1.5
1.5 1.0
x 5
(a) Consider the data as 2-D data points. Given a new data point, x = (1.4,1.6) as a
query, rank the database points based on similarity with the query using Euclidean
distance, Manhattan distance, supremum distance, and cosine similarity.
(b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean
distance on the transformed data to rank the data points.
2.7 Bibliographic Notes
Methods for descriptive data summarization have been studied in the statistics literature
long before the onset of computers. Good summaries of statistical descriptive data min-
ing methods include Freedman, Pisani, and Purves [FPP07] and Devore [Dev95]. For