Page 108 - Building Big Data Applications

P. 108

104 Building Big Data Applications

how to align outcomes and readings, if the actual dosage information was not captured or
was missing or was incorrectly stated. This type of a situation calls for earlier data dis-
covery or exploration and that is just what the big data platform provides to us. We can use
interactive query applications like Tableau or Apache Drill or Apache Presto to run the
exercises. The preferred format of data is JSON, which provides us the ﬂexibility to add
more values as needed or correct the values as data is ingested and explored.
Once we have completed data discovery and data exploration, we are now at a stage
where we can attribute and deﬁne the data. In the world of applications, data attribution
and deﬁnition need to be done from the raw data layers and through each stage of
transformation that data needs to evolve. The attribution of data includes the identiﬁ-
cation of the data, its metadata, its format, and all values associated with the data in the
samples or the actual data received. The attribution process will be important to
structure the new data which can be from any source and have any format. The attri-
bution will deliver the metadata and deﬁne the formats.
The next step to identify and deﬁne complexity is completed by the analysis of data
after the attribution process. The complexity of data is identiﬁed by the analysis of the
data. What complexity are we talking about is the incompleteness of data and missing
data that needs to be deﬁned in the analysis. The dependency on those aspects of data is
very important to understand from both the analytics and the application usage per-
spectives. This complexity haunts even the best designed data warehouse and analytics
platforms. In the big data application world, the underlying infrastructure provides us
the opportunity to examine the raw data and ﬁx these issues. This complexity needs to
be managed and documented.
Another complexity that needs to be deﬁned and managed is the source data from
multiple sources colliding and causing issues. The business teams’ rules are deﬁned
once the data has been deﬁned and the issue needs to be sorted out, but what about the
raw data? In the world of big data applications this issue needs to be handled at the raw
data level too as data discovery and exploration, and operational analytics are all
executed in the raw data layer. To manage this complexity, we will be tagging each
source row, ﬁle, or document. The rows or ﬁles or documents that are conﬂicting will be
available for inspection and the users can determine what they want to do with the data
and recommend the rules for ingestion of the data.
The next step to manage complexity is to classify and segment the data. This process
will classify the data by the type of data, the format and its complexity, and segment the
data, ﬁles and associated documents into separate directories for management of the
data from the raw data layer to be managed across till the application layer.

Complexities in transformation of data

There are several states of transformation we will make on the data from the time it is
available till it is used in an application. These transformations will carry several

103 104 105 106 107 108 109 110 111 112 113