Page 45 - Biosystems Engineering
P. 45
26 Cha pte r O n e
from two transcription databases: TRANSFAC (Wingender et al.
2001) and SCPD (Zhu and Zhang 1999). Qian et al. (2003) used SVM
to predict the regulatory targets for 36 transcription factors in the
S. cerevisiae genome based on the microarray expression data from
many different physiological conditions. They assessed the perfor-
mance of their regulatory network identifications by comparing
them with the results from two recent genomewide ChIP–chip
experiments. They found that the agreement between their results and
these experiments was comparable to the agreement between the two
experiments.
1.6 Summary
This chapter introduces computational methods for analysis of
microarray data including gene clustering, marker gene selection,
prediction of phenotypic classes, and modeling of genetic network.
Because large-volume and high-dimensional data are being gener-
ated by the rapidly expanding microarray technology, the number of
reported applications of machine learning methods is expected to
increase. With increasing demand, however, comes the need for fur-
ther improvements that can make implementation of machine learn-
ing algorithms in microarray data analysis more efficient. Key
improvements include (1) enhanced computational power to handle
high dimensionality and large-volume data; (2) improved microarray
technology with a high-resolution scanner, low background noise,
low technical variability, etc.; (3) enhanced quality control and proto-
col; (4) well-designed low-level analysis methods for background
correction, cross-talk removal, normalization, outlier screening, and
summary measures; (5) improved visualization tools to assess data
quality and interpret results; (6) better data storage and retrieval
mechanisms; and (7) advances in machine learning methods to
enhance their speed and make them more accessible to the user.
Integration of gene expression data with genomic information (e.g.,
how transcriptional regulators bind to promoter sequences across the
genome) and other prior biological knowledge is one of the future
goals of bioinformatics. It is important, however, to ensure that existing
biological knowledge is reliable. In particular, although supervised
machine learning methods can take advantage of prior knowledge in
constructing a model, their success is highly dependent on the quality
of prior knowledge from previous experiments. For example, in con-
structing a classifier, if inaccurately labeled data are used for learning,
the classification result will be impaired. Note that like other empirical
models, machine learning models are only as good as the dataset to
which they are applied; hence, the quality of the data collected is very
important. Thus, we believe that the use of computational methods
alone cannot provide a solution to the complex task of microarray data