Page 37 - Biosystems Engineering
P. 37

18    Cha pte r  O n e

               singular value decomposition surpass the commonly used row aver-
               age method and filling missing values with zeros. Kim et al. (2005)
               proposed the local least squares imputation method that represents a
               missing value of a gene as a linear combination of similar genes.
                   An outlier is a data pattern that deviates substantially from data
               distribution. Outliers can have severe effects on accuracy. The problem
               can be addressed by removing outliers through statistical techniques
               such as multidimensional scaling or by using robust classification
               methods that are not influenced by outliers.
                   Transforming data into a form acceptable by an analysis method
               may be necessary. For example, for a statistical analysis (e.g., para-
               metric t-test) that requires the data to be normally distributed, apply-
               ing a logarithmic transformation of the data improves approximation
               to a normal distribution.
                   Dimensionality reduction is essential in exploratory data analy-
               sis, where the purpose is to map data onto a low-dimensional space
               for improved visualization. It also reduces the complexity of a prob-
               lem and makes it easier to perform high-level data analysis such as
               building a classifier. Dimensionality reduction can be accomplished
               through feature extraction and selection where an optimum subset of
               features derived from the input variables (e.g., genes) is selected.
               Thus, feature selection methods keep only useful features and dis-
               card others. Note that feature selection is distinct from variable selec-
               tion, because the former constructs new features out of the original
               variables and chooses the most relevant features. One well-known
               linear transformation used to reduce model dimensionality is princi-
               pal component analysis (PCA). PCA transforms the input variables to
               a new set of variables (features). The new variables (also known as
               principal components) are computed as a linear combination of the
               original variables and are orthogonal to each other. PCA reduces
               input dimensionality by providing a subset of the principal compo-
               nents that captures most of the information in the original data. A
               classifier with the selected principal components as inputs may provide
               a better accuracy than a classifier with a large dimension of original
               variables consisting of coregulated genes.



          1.5 High-Level Analysis

               1.5.1 Clustering
               Clustering is a useful exploratory technique for analysis of large-volume
               high-dimensional data when there is no a priori information about
               existing common properties. Clustering algorithms help discover com-
               mon properties contained within the data and create groups of objects
               according to their properties. These groups could be objects clustered
               by samples or by gene. The former helps researchers identify which of
   32   33   34   35   36   37   38   39   40   41   42