Page 36 - Biosystems Engineering
P. 36
Micr oarray Data Analysis Using Machine Learning Methods 17
and linearity of intensities and can be used to study and design back-
ground correction and normalization methods. The purpose of nor-
malization is to ensure that the measurements from different arrays are
comparable by compensating for systematic technical differences between
arrays. The differences can be caused by variability in the labeling,
hybridization, scanner setting, amounts of RNA, etc. By compensating
for such technical differences, a better estimate of the real biological
differences between samples can be made. The goal of most normal-
ization approaches is thus to obtain the same data distribution across
arrays. Normalization based on housekeeping genes—constitutively
expressed genes—is commonly used (Wang et al. 2002). Min–max
scaling preserves the relationships among the original data. Mean cen-
tering is more appropriate when the data contain no biases. Variance
scaling is appropriate when training data are measured with different
units. Z-score normalization is a combination of mean centering and
variance scaling, and can be very useful when there are outliers pres-
ent in the data. Linear and nonlinear fitting methods have also been
proposed. The most common ones are locally weighted linear regres-
sion (lowess) and quantile normalization methods, whose goal is to
obtain the same data distribution across arrays.
Affymetrix data preprocessing involves (1) image quantification,
(2) quality control, (3) background adjustment to minimize the effect
of nonspecific binding and optical noise, (4) normalization to ensure
that the measurements from different arrays are comparable, and
(5) summarization to obtain an expression value for a probe set by com-
bining multiple probe intensities. The preprocessing methods used
by Affymetrix’s software, MicroArray Suite (MAS 5.0), have shown
to be suboptimal. Li and Wong (2001) observed this and proposed an
alternative model-based expression index (MBEI). Irizarry et al.
(2003) proposed a robust multiarray analysis (RMA) method. These
two methods are based on multichip models and are implemented in
dChip and Bioconductor, respectively. RMA for background adjust-
ment, quantile method for normalization, and robust multiarray
average for summarization have provided better performance than
MAS 5.0 and MBEI in detecting known levels of differential expres-
sion using spike-in Affymetrix data (Irizarry et al. 2003).
Other low-level analyses that need to be performed prior to under-
taking a high-level analysis include handling missing values, screen-
ing outliers, data transformation, and dimensionality reduction.
It is common that microarray data have missing values. However,
a value is required for each entry of a gene expression matrix.
Although self-organizing models do not suffer under these problems,
in supervised methods, missing values are a problem. Several options
have been used to handle missing values such as removing the entire
gene if there is a missing value, replacing a missing value with zero,
and replacing with an average value. Troyanskaya et al. (2001)
reported that methods based on weighted k-nearest neighbors and