Page 45 - Biosystems Engineering
P. 45

26    Cha pte r  O n e

               from two transcription databases: TRANSFAC (Wingender et al.
               2001) and SCPD (Zhu and Zhang 1999). Qian et al. (2003) used SVM
               to predict the regulatory targets for 36 transcription factors in the
               S. cerevisiae genome based on the microarray expression data from
               many different physiological conditions. They assessed the perfor-
               mance of their regulatory network identifications by comparing
               them with the results from two recent genomewide ChIP–chip
               experiments. They found that the agreement between their results and
               these experiments was comparable to the agreement between the two
               experiments.



          1.6 Summary
               This chapter introduces computational methods for analysis of
               microarray data including gene clustering, marker gene selection,
               prediction of phenotypic classes, and modeling of genetic network.
               Because large-volume and high-dimensional data are being gener-
               ated by the rapidly expanding microarray technology, the number of
               reported applications of machine learning methods is expected to
               increase. With increasing demand, however, comes the need for fur-
               ther improvements that can make implementation of machine learn-
               ing algorithms in microarray data analysis more efficient. Key
               improvements include (1) enhanced computational power to handle
               high dimensionality and large-volume data; (2) improved microarray
               technology with a high-resolution scanner, low background noise,
               low technical variability, etc.; (3) enhanced quality control and proto-
               col; (4) well-designed low-level analysis methods for background
               correction, cross-talk removal, normalization, outlier screening, and
               summary measures; (5) improved visualization tools to assess data
               quality and interpret results; (6) better data storage and retrieval
               mechanisms; and (7) advances in machine learning methods to
               enhance their speed and make them more accessible to the user.
                   Integration of gene expression data with genomic information (e.g.,
               how transcriptional regulators bind to promoter sequences across the
               genome) and other prior biological knowledge is one of the future
               goals of bioinformatics. It is important, however, to ensure that existing
               biological knowledge is reliable. In particular, although supervised
               machine learning methods can take advantage of prior knowledge in
               constructing a model, their success is highly dependent on the quality
               of prior knowledge from previous experiments. For example, in con-
               structing a classifier, if inaccurately labeled data are used for learning,
               the classification result will be impaired. Note that like other empirical
               models, machine learning models are only as good as the dataset to
               which they are applied; hence, the quality of the data collected is very
               important. Thus, we believe that the use of computational methods
               alone cannot provide a solution to the complex task of microarray data
   40   41   42   43   44   45   46   47   48   49   50