Page 107 - Building Big Data Applications
P. 107

Chapter 5   Pharmacy industry applications and usage  103


                 incoming events up into small batches either by arrival time or until a batch has reached
                 a certain size. This reduces the computational cost of processing but also introduces
                 more latency. To determine the use of streaming data, ask if the value of the data de-
                 creases over time. If the answer is yes, then you might be looking at a case where real-
                 time processing and analysis is beneficial.
                   An exciting application of streaming data is in the field of machine learning, which is a
                 hot subject area across big data platforms and cloud computing today. One of the benefits
                 of streaming data is the ability to retrain the algorithms that are used in machine learning,
                 as unsupervised learning reinforces the algorithm to learn as new data becomes available
                 in streams. These patterns and identities of data that are collected and stored as data
                 processing occurs will be used in the data attribution and definition process. The data
                 collected will also contain associated metadata which will be useful in defining the ar-
                 chitecture of the table and the associated file in the data lake and further application areas.
                   Another approach to understand the complexity of the data is to execute a data
                 discovery process. In this step, the data is acquired and stored as raw data sets. The
                 acquisition process if not streaming, can use Apache Kafka or Apache NiFi processes to
                 define the structure and layout of the data. Understanding this is essential in data dis-
                 covery as we will run through several iterations of analysis of the data received, the time
                 of its receipt, the business value of the data, the insights that can be delivered. Data
                 discovery is an interesting exercise as you need to know what you are looking for. This
                 technique is useful in reading log files, process outcomes, and decision support mech-
                 anisms. In research the technique is excellent to identify the steps and outcomes for any
                 experiments. The benefit of this approach is the ability for the end user to deliver the
                 complexity of the data and its usefulness. We will do data discovery even if we execute
                 streaming data analysis, this is a must do step for realizing the overall structure, attri-
                 bution, formats, value and insights possible from the data acquired.
                   Another key step once we execute data discovery process is the data exploration. In this
                 process we will look through the data and see where all will any connects of the data with
                 other data occur. The exploration of data is essential to determine not only the connect
                 possibilities, but also to see if the data has issues including but not limited to the following:
                   Duplicate information
                   Missing data
                   Random variables
                   Format issues
                   Incorrect rounding or float termination

                   These issues can be discovered during data exploration and be identified for evaluation
                 and source systems can be notified if these issues need to be fixed prior to next feed of
                 data from the same system. An interesting example is the usage of drugs for treatment of
                 patients; let us assume the same Benadryl is being given to pediatric patients and adult
                 patients. The system feeds for patient vitals and drugs administered are electronic, and in
                 this situation if the dosage is not identified as pediatric or adult, it will be confusing on
   102   103   104   105   106   107   108   109   110   111   112