Page 108 - Building Big Data Applications
P. 108
104 Building Big Data Applications
how to align outcomes and readings, if the actual dosage information was not captured or
was missing or was incorrectly stated. This type of a situation calls for earlier data dis-
covery or exploration and that is just what the big data platform provides to us. We can use
interactive query applications like Tableau or Apache Drill or Apache Presto to run the
exercises. The preferred format of data is JSON, which provides us the flexibility to add
more values as needed or correct the values as data is ingested and explored.
Once we have completed data discovery and data exploration, we are now at a stage
where we can attribute and define the data. In the world of applications, data attribution
and definition need to be done from the raw data layers and through each stage of
transformation that data needs to evolve. The attribution of data includes the identifi-
cation of the data, its metadata, its format, and all values associated with the data in the
samples or the actual data received. The attribution process will be important to
structure the new data which can be from any source and have any format. The attri-
bution will deliver the metadata and define the formats.
The next step to identify and define complexity is completed by the analysis of data
after the attribution process. The complexity of data is identified by the analysis of the
data. What complexity are we talking about is the incompleteness of data and missing
data that needs to be defined in the analysis. The dependency on those aspects of data is
very important to understand from both the analytics and the application usage per-
spectives. This complexity haunts even the best designed data warehouse and analytics
platforms. In the big data application world, the underlying infrastructure provides us
the opportunity to examine the raw data and fix these issues. This complexity needs to
be managed and documented.
Another complexity that needs to be defined and managed is the source data from
multiple sources colliding and causing issues. The business teams’ rules are defined
once the data has been defined and the issue needs to be sorted out, but what about the
raw data? In the world of big data applications this issue needs to be handled at the raw
data level too as data discovery and exploration, and operational analytics are all
executed in the raw data layer. To manage this complexity, we will be tagging each
source row, file, or document. The rows or files or documents that are conflicting will be
available for inspection and the users can determine what they want to do with the data
and recommend the rules for ingestion of the data.
The next step to manage complexity is to classify and segment the data. This process
will classify the data by the type of data, the format and its complexity, and segment the
data, files and associated documents into separate directories for management of the
data from the raw data layer to be managed across till the application layer.
Complexities in transformation of data
There are several states of transformation we will make on the data from the time it is
available till it is used in an application. These transformations will carry several