Page 114 - Building Big Data Applications
P. 114

110   Building Big Data Applications


                different symptoms and test results. In theory, this could lead to even earlier diag-
                nosis of complex conditions.
             4. Continual learning and improvement: Because AI tools get better over time, they
                will also help hospitals to continually learn about the approaches that help patients
                most.

             Case study


               Novartis Institutes for Biomedical Research (NIBR), the global pharmaceutical
                research organization for Novartis. NIBR takes a unique approach to pharmaceu-
                tical researchdat the earliest stages, analyzing and understanding the patient need,
                and disease specifics and responses, which align and help them determine their
                research priorities. On any given day, their scientists are working hard at nine
                research institutes around the world to bring innovative medicines to patients.
                Over 6000 scientists, physicians, and business professionals work in this open,
                entrepreneurial, and innovative culture that encourages true collaboration. One of
                NIBR’s many interesting drug research areas is in Next Generation Sequencing
                (NGS) research. NGS research requires a lot of interaction with diverse data from
                external organizations such as clinical, phenotypical, experimental, and other asso-
                ciated data. Integrating all of these heterogeneous datasets is very labor intensive,
                so they only want to do it once.

                One of the challenges they face is that as the cost of sequencing continues to drop
                exponentially, the amount of data that’s being produced increases. Because of this,
                Novartis needed a highly flexible big data infrastructure so that the latest analytical
                tools, techniques, and databases could be swapped into their platform with minimal
                effort as NGS technologies and scientific requirements change. The Novartis team
                chose Apache Hadoop platform for investigating and discovery of data and its
                associated relationships and complexities.
                NGS data requires high data volumes that are ideal for Hadoop, a common problem is
                that researchers rely on many tools that don’t work on native HDFS. Since these
                researchers previously couldn’t use systems like Hadoop, they have had to maintain
                complicated "bookkeeping" logic to parallelize for optimum efficiency on traditional
                High-Performance Computing (HPC). This workflow system uses Hadoop for its
                performance and robustness and to provide the POSIX file access (MapR Hadoop)
                that lets bioinformaticians use their familiar tools. Additionally, it uses the re-
                searchers’ own metadata to allow them to write complex workflows that blend the
                best aspects of Hadoop and traditional HPC. The team then uses Apache Spark to
                integrate the highly diverse datasets. Their unique approach to dealing with hetero-
                geneity was to represent the data as a vast knowledge graph (currently trillions of
                edges) that is stored in HDFS and manipulated with custom Spark code.
   109   110   111   112   113   114   115   116   117   118   119