Page 36 - Biosystems Engineering
P. 36

Micr oarray Data Analysis Using Machine Learning Methods       17

               and linearity of intensities and can be used to study and design back-
               ground correction and normalization methods. The purpose of nor-
               malization is to ensure that the measurements from different arrays are
               comparable by compensating for systematic technical differences between
               arrays. The differences can be caused by variability in the labeling,
               hybridization, scanner setting, amounts of RNA, etc. By compensating
               for such technical differences, a better estimate of the real biological
               differences between samples can be made. The goal of most normal-
               ization approaches is thus to obtain the same data distribution across
               arrays. Normalization based on housekeeping genes—constitutively
               expressed genes—is commonly used (Wang et al. 2002). Min–max
               scaling preserves the relationships among the original data. Mean cen-
               tering is more appropriate when the data contain no biases. Variance
               scaling is appropriate when training data are measured with different
               units. Z-score normalization is a combination of mean centering and
               variance scaling, and can be very useful when there are outliers pres-
               ent in the data. Linear and nonlinear fitting methods have also been
               proposed. The most common ones are locally weighted linear regres-
               sion (lowess) and quantile normalization methods, whose goal is to
               obtain the same data distribution across arrays.
                   Affymetrix data preprocessing involves (1) image quantification,
               (2) quality control, (3) background adjustment to minimize the effect
               of nonspecific binding and optical noise, (4) normalization to ensure
               that the measurements from different arrays are comparable, and
               (5) summarization to obtain an expression value for a probe set by com-
               bining multiple probe intensities. The preprocessing methods used
               by Affymetrix’s software, MicroArray Suite (MAS 5.0), have shown
               to be suboptimal. Li and Wong (2001) observed this and proposed an
               alternative model-based expression index (MBEI). Irizarry et al.
               (2003) proposed a robust multiarray analysis (RMA) method. These
               two methods are based on multichip models and are implemented in
               dChip and Bioconductor, respectively. RMA for background adjust-
               ment, quantile method for normalization, and robust multiarray
               average for summarization have provided better performance than
               MAS 5.0 and MBEI in detecting known levels of differential expres-
               sion using spike-in Affymetrix data (Irizarry et al. 2003).
                   Other low-level analyses that need to be performed prior to under-
               taking a high-level analysis include handling missing values, screen-
               ing outliers, data transformation, and dimensionality reduction.
                   It is common that microarray data have missing values. However,
               a value is required for each entry of a gene expression matrix.
               Although self-organizing models do not suffer under these problems,
               in supervised methods, missing values are a problem. Several options
               have been used to handle missing values such as removing the entire
               gene if there is a missing value, replacing a missing value with zero,
               and replacing with an average value. Troyanskaya et al. (2001)
               reported that methods based on weighted k-nearest neighbors and
   31   32   33   34   35   36   37   38   39   40   41