Page 114 - Building Big Data Applications

P. 114

110 Building Big Data Applications

different symptoms and test results. In theory, this could lead to even earlier diag-
nosis of complex conditions.
4. Continual learning and improvement: Because AI tools get better over time, they
will also help hospitals to continually learn about the approaches that help patients
most.

Case study

Novartis Institutes for Biomedical Research (NIBR), the global pharmaceutical
research organization for Novartis. NIBR takes a unique approach to pharmaceu-
tical researchdat the earliest stages, analyzing and understanding the patient need,
and disease speciﬁcs and responses, which align and help them determine their
research priorities. On any given day, their scientists are working hard at nine
research institutes around the world to bring innovative medicines to patients.
Over 6000 scientists, physicians, and business professionals work in this open,
entrepreneurial, and innovative culture that encourages true collaboration. One of
NIBR’s many interesting drug research areas is in Next Generation Sequencing
(NGS) research. NGS research requires a lot of interaction with diverse data from
external organizations such as clinical, phenotypical, experimental, and other asso-
ciated data. Integrating all of these heterogeneous datasets is very labor intensive,
so they only want to do it once.

One of the challenges they face is that as the cost of sequencing continues to drop
exponentially, the amount of data that’s being produced increases. Because of this,
Novartis needed a highly ﬂexible big data infrastructure so that the latest analytical
tools, techniques, and databases could be swapped into their platform with minimal
effort as NGS technologies and scientiﬁc requirements change. The Novartis team
chose Apache Hadoop platform for investigating and discovery of data and its
associated relationships and complexities.
NGS data requires high data volumes that are ideal for Hadoop, a common problem is
that researchers rely on many tools that don’t work on native HDFS. Since these
researchers previously couldn’t use systems like Hadoop, they have had to maintain
complicated "bookkeeping" logic to parallelize for optimum efﬁciency on traditional
High-Performance Computing (HPC). This workﬂow system uses Hadoop for its
performance and robustness and to provide the POSIX ﬁle access (MapR Hadoop)
that lets bioinformaticians use their familiar tools. Additionally, it uses the re-
searchers’ own metadata to allow them to write complex workﬂows that blend the
best aspects of Hadoop and traditional HPC. The team then uses Apache Spark to
integrate the highly diverse datasets. Their unique approach to dealing with hetero-
geneity was to represent the data as a vast knowledge graph (currently trillions of
edges) that is stored in HDFS and manipulated with custom Spark code.

109 110 111 112 113 114 115 116 117 118 119