Page 108 - Building Big Data Applications
P. 108

104   Building Big Data Applications


             how to align outcomes and readings, if the actual dosage information was not captured or
             was missing or was incorrectly stated. This type of a situation calls for earlier data dis-
             covery or exploration and that is just what the big data platform provides to us. We can use
             interactive query applications like Tableau or Apache Drill or Apache Presto to run the
             exercises. The preferred format of data is JSON, which provides us the flexibility to add
             more values as needed or correct the values as data is ingested and explored.
                Once we have completed data discovery and data exploration, we are now at a stage
             where we can attribute and define the data. In the world of applications, data attribution
             and definition need to be done from the raw data layers and through each stage of
             transformation that data needs to evolve. The attribution of data includes the identifi-
             cation of the data, its metadata, its format, and all values associated with the data in the
             samples or the actual data received. The attribution process will be important to
             structure the new data which can be from any source and have any format. The attri-
             bution will deliver the metadata and define the formats.
                The next step to identify and define complexity is completed by the analysis of data
             after the attribution process. The complexity of data is identified by the analysis of the
             data. What complexity are we talking about is the incompleteness of data and missing
             data that needs to be defined in the analysis. The dependency on those aspects of data is
             very important to understand from both the analytics and the application usage per-
             spectives. This complexity haunts even the best designed data warehouse and analytics
             platforms. In the big data application world, the underlying infrastructure provides us
             the opportunity to examine the raw data and fix these issues. This complexity needs to
             be managed and documented.
                Another complexity that needs to be defined and managed is the source data from
             multiple sources colliding and causing issues. The business teams’ rules are defined
             once the data has been defined and the issue needs to be sorted out, but what about the
             raw data? In the world of big data applications this issue needs to be handled at the raw
             data level too as data discovery and exploration, and operational analytics are all
             executed in the raw data layer. To manage this complexity, we will be tagging each
             source row, file, or document. The rows or files or documents that are conflicting will be
             available for inspection and the users can determine what they want to do with the data
             and recommend the rules for ingestion of the data.
                The next step to manage complexity is to classify and segment the data. This process
             will classify the data by the type of data, the format and its complexity, and segment the
             data, files and associated documents into separate directories for management of the
             data from the raw data layer to be managed across till the application layer.


             Complexities in transformation of data

             There are several states of transformation we will make on the data from the time it is
             available till it is used in an application. These transformations will carry several
   103   104   105   106   107   108   109   110   111   112   113