Page 355 - Data Architecture
P. 355

Chapter 9.1: Repetitive Analytics: Some Basics
           processing, the “turnaround time” to do an analysis can be very important.


           Sampling is especially important when doing heuristic analysis against big data because of
           the sheer volume of data that has to be processed.


           Fig. 9.1.11 shows the creation of an analytic sample.


























               Fig. 9.1.11 Creating the analytical sample.

           There are some downsides to sampling. One downside is that the analytic results obtained

           when processing the sample may be different than the processing results achieved when
           processing against the entire database. For example, the sampling may produce the results
           that the average age of a customer is 35.78 years. When the full database is processed, it
           may be found that the average age of the customer is really 36.21 years old. In some
           cases, this small differential between results is inconsequential. In other cases, the
           difference in results is truly significant. Whether there is significance or not depends on
           how much difference there is and the importance of accuracy.


           If there is not much of a problem with slight inaccuracies of data, then sampling works
           well.


           If in fact, there is a desire to get the results as accurate as possible, then the algorithmic
           development can be done against sampling data. When the analyst is satisfied that the
           sampling results are being done properly, then the final run can be made against the entire

           database, thereby satisfying the needs to do analysis quickly and the need to achieve
           accurate results.




                                                                                                               355
   350   351   352   353   354   355   356   357   358   359   360