Page 355 - Data Architecture

P. 355

Chapter 9.1: Repetitive Analytics: Some Basics
processing, the “turnaround time” to do an analysis can be very important.

Sampling is especially important when doing heuristic analysis against big data because of
the sheer volume of data that has to be processed.

Fig. 9.1.11 shows the creation of an analytic sample.

Fig. 9.1.11 Creating the analytical sample.

There are some downsides to sampling. One downside is that the analytic results obtained

when processing the sample may be different than the processing results achieved when
processing against the entire database. For example, the sampling may produce the results
that the average age of a customer is 35.78 years. When the full database is processed, it
may be found that the average age of the customer is really 36.21 years old. In some
cases, this small differential between results is inconsequential. In other cases, the
difference in results is truly significant. Whether there is significance or not depends on
how much difference there is and the importance of accuracy.

If there is not much of a problem with slight inaccuracies of data, then sampling works
well.

If in fact, there is a desire to get the results as accurate as possible, then the algorithmic
development can be done against sampling data. When the analyst is satisfied that the
sampling results are being done properly, then the final run can be made against the entire

database, thereby satisfying the needs to do analysis quickly and the need to achieve
accurate results.

355

350 351 352 353 354 355 356 357 358 359 360