Page 324 - Classification Parameter Estimation & State Estimation An Engg Approach Using MATLAB
P. 324

BOSTON HOUSING CLASSIFICATION PROBLEM                        313

            indicating that some overtraining occurs and/or that some feature selection
            and feature scaling might be required. The next step, therefore, is to rescale
            the data to have zero mean and unit variance in all feature directions. This
            can be performed using the PRTools function scalem([],‘variance’):

            Listing 9.2

            load housing.mat;
            % Define an untrained linear classifier w/scaled input data
            w_sc ¼ scalem([],‘variance’);
            w ¼ w_sc*ldc;
            % Perform 5-fold cross-validation
            err_ldc_sc ¼ crossval(z,w,5)
            % Do the same for some other classifiers
            err_qdc_sc ¼ crossval(z,w_sc*qdc,5)
            err_knnc_sc ¼ crossval(z,w_sc*knnc,5)
            err_parzenc_sc ¼ crossval(z,w_sc*parzenc,5)
            First note, that when we introduce a preprocessing step, this step should
            be defined inside the mapping w. The obvious approach, to map the
            whole data set z_sc ¼ z*scalem(a,‘variance’) and then to apply
            the cross-validation to estimate the classifier performance, is incorrect.
            In that case, some of the testing data is already used in the scaling of the
            data, resulting in an overtrained classifier, and thus in an unfair estimate
            of the error. To avoid this, the mapping should be extended from ldc to
            w_sc*ldc. The routine crossval then takes care of fitting both the
            scaling and the classifier.
              By scaling the features, the performance of the first two classifiers,
            ldc and qdc, should not change. The normal density based classifiers
            are insensitive to the scaling of the individual features, because they
            already use their variance estimation. The performance of knnc and
            parzenc on the other hand improve significantly, to 14.1% and
            13.1%, respectively (with a standard deviation of about 0.4%).
            Although parzenc approaches the performance of the linear classifier,
            it is still slightly worse. Perhaps feature extraction or feature selection
            will improve the results.
              As discussed in Chapter 7, principal component analysis (PCA) is one
            of the most often used feature extraction methods. It focuses on the high-
            variance directions in the data, and removes low-variance directions.
            In this data set we have seen that the feature values have very different
            scales. Applying PCA directly to this data will put high emphasis on the
            feature TAX and will probably ignore the feature NOX. Indeed, when
   319   320   321   322   323   324   325   326   327   328   329