Page 107 - Building Big Data Applications

P. 107

Chapter 5 Pharmacy industry applications and usage 103

incoming events up into small batches either by arrival time or until a batch has reached
a certain size. This reduces the computational cost of processing but also introduces
more latency. To determine the use of streaming data, ask if the value of the data de-
creases over time. If the answer is yes, then you might be looking at a case where real-
time processing and analysis is beneﬁcial.
An exciting application of streaming data is in the ﬁeld of machine learning, which is a
hot subject area across big data platforms and cloud computing today. One of the beneﬁts
of streaming data is the ability to retrain the algorithms that are used in machine learning,
as unsupervised learning reinforces the algorithm to learn as new data becomes available
in streams. These patterns and identities of data that are collected and stored as data
processing occurs will be used in the data attribution and deﬁnition process. The data
collected will also contain associated metadata which will be useful in deﬁning the ar-
chitecture of the table and the associated ﬁle in the data lake and further application areas.
Another approach to understand the complexity of the data is to execute a data
discovery process. In this step, the data is acquired and stored as raw data sets. The
acquisition process if not streaming, can use Apache Kafka or Apache NiFi processes to
deﬁne the structure and layout of the data. Understanding this is essential in data dis-
covery as we will run through several iterations of analysis of the data received, the time
of its receipt, the business value of the data, the insights that can be delivered. Data
discovery is an interesting exercise as you need to know what you are looking for. This
technique is useful in reading log ﬁles, process outcomes, and decision support mech-
anisms. In research the technique is excellent to identify the steps and outcomes for any
experiments. The beneﬁt of this approach is the ability for the end user to deliver the
complexity of the data and its usefulness. We will do data discovery even if we execute
streaming data analysis, this is a must do step for realizing the overall structure, attri-
bution, formats, value and insights possible from the data acquired.
Another key step once we execute data discovery process is the data exploration. In this
process we will look through the data and see where all will any connects of the data with
other data occur. The exploration of data is essential to determine not only the connect
possibilities, but also to see if the data has issues including but not limited to the following:
Duplicate information
Missing data
Random variables
Format issues
Incorrect rounding or ﬂoat termination

These issues can be discovered during data exploration and be identiﬁed for evaluation
and source systems can be notiﬁed if these issues need to be ﬁxed prior to next feed of
data from the same system. An interesting example is the usage of drugs for treatment of
patients; let us assume the same Benadryl is being given to pediatric patients and adult
patients. The system feeds for patient vitals and drugs administered are electronic, and in
this situation if the dosage is not identiﬁed as pediatric or adult, it will be confusing on

102 103 104 105 106 107 108 109 110 111 112