Page 37 - Biosystems Engineering

P. 37

18 Cha pte r O n e

singular value decomposition surpass the commonly used row aver-
age method and filling missing values with zeros. Kim et al. (2005)
proposed the local least squares imputation method that represents a
missing value of a gene as a linear combination of similar genes.
An outlier is a data pattern that deviates substantially from data
distribution. Outliers can have severe effects on accuracy. The problem
can be addressed by removing outliers through statistical techniques
such as multidimensional scaling or by using robust classification
methods that are not influenced by outliers.
Transforming data into a form acceptable by an analysis method
may be necessary. For example, for a statistical analysis (e.g., para-
metric t-test) that requires the data to be normally distributed, apply-
ing a logarithmic transformation of the data improves approximation
to a normal distribution.
Dimensionality reduction is essential in exploratory data analy-
sis, where the purpose is to map data onto a low-dimensional space
for improved visualization. It also reduces the complexity of a prob-
lem and makes it easier to perform high-level data analysis such as
building a classifier. Dimensionality reduction can be accomplished
through feature extraction and selection where an optimum subset of
features derived from the input variables (e.g., genes) is selected.
Thus, feature selection methods keep only useful features and dis-
card others. Note that feature selection is distinct from variable selec-
tion, because the former constructs new features out of the original
variables and chooses the most relevant features. One well-known
linear transformation used to reduce model dimensionality is princi-
pal component analysis (PCA). PCA transforms the input variables to
a new set of variables (features). The new variables (also known as
principal components) are computed as a linear combination of the
original variables and are orthogonal to each other. PCA reduces
input dimensionality by providing a subset of the principal compo-
nents that captures most of the information in the original data. A
classifier with the selected principal components as inputs may provide
a better accuracy than a classifier with a large dimension of original
variables consisting of coregulated genes.

1.5 High-Level Analysis

1.5.1 Clustering
Clustering is a useful exploratory technique for analysis of large-volume
high-dimensional data when there is no a priori information about
existing common properties. Clustering algorithms help discover com-
mon properties contained within the data and create groups of objects
according to their properties. These groups could be objects clustered
by samples or by gene. The former helps researchers identify which of

32 33 34 35 36 37 38 39 40 41 42