Page 70 -
P. 70
3.2 The Standardization Issue 57
First of all, the need for any standardization must be questioned. If the
interesting clusters are based on the original features, then any standardization
method may distort or mask those clusters. It is only when there are grounds to
search for clusters in a transformed space that some standardization rule should be
used.
Several simple standardization methods have been proposed for achieving scale
invariance or at least attempting a balanced contribution of all features to distance
measurements:
y, = (xi - m)ls with rn, s resp. mean and standard deviation of xi ; (3-2a)
yi = (xi - min(xi))l(max(xi)-min(xi)); (3-2b)
yi= x, l(max(xi)-min(xi)); (3-2c)
yi= xi /a . (3-2d)
There is also, of course, the more sophisticated orthonormal transformation,
described in section 2.3, preserving the Mahalanobis distance. All these
standardization methods have some pitfalls. Consider, for instance, the popular
standardization method of obtaining scale invariance by using transformed features
with zero mean and unit variance (3-2a). An evident pitfall is that semantic
information from the features can be lost with this standardization. Another
problem is that this unit variance standardization is only adequate if the differing
feature variances are due only to random variation. However, if such variation is
due to data partition in distinct clusters it may produce totally wrong results.
If we know beforehand the type of clusters we are dealing with, we can devise a
suitable standardization method. This poses the following vicious circle:
1. In order to perform clustering we need an appropriate distance measure.
2. The appropriate distance measure depends on the feature standardization
method.
3. In order to select the standardization method we need to know the type of
clusters needed.
There is no methodological way out of this vicious circle except by a "trial and
error" approach, experimenting with various alternatives and evaluating the
corresponding solutions aided by visual inspection, data interpretation and utility
considerations. An easy standardization method that we will often follow and
frequently achieves good results is the division or multiplication by a simple scale
factor (e.g. a power of lo), properly chosen so that all feature values occupy a
suitable interval. This corresponds to method (3-2d). In this way we can balance
the contribution of the features and still retain semantic information.