Page 204 - Building Big Data Applications
P. 204
204 Building Big Data Applications
Given the data volumes in today’s data lakes, both in terms of the numbers and the
size of each, discovering such relationships on a manual basis is sufficiently difficult as
to be thoroughly impractical. Automation is your savior, and algorithmic detection of
such relationships and flows can be achieved through analysis of data values, distribu-
tion, formulas, and so forth. But how do we get to this level of sophistication? Who can
guide us in this journey?
This is an issue for data management tools, too. When you are going to manage a
comprehensive data catalog, then all the data across the enterprise must be in it. The
data catalog must be well categorized, tagged, and managed with appropriate metadata,
and eventually the exercise should help the enterprise arrest all the silos and integrate
the data into one repository with all the interfaces and access mechanisms established.
In establishing the catalogs and managing them, if data sets from the data lake are not
properly cataloged, the lake will quickly become mismanaged and lead to even further
frustration among users. This is especially the case because of the physical format of a
data lake: a collection of files in a folder structure. Similarly, if data sets across the da-
tabases are not cataloged in the exercise, they will still be hanging loose and create a
mess when the data catalog executes. We need a tool to ensure that all this happens
governed and will result in a compliant outcome (Fig. 11.7).
The reason for searching a tool that can be “smart” is primarily to account for external
data that comes at us from the world of internet and aligning them with internal
corporate data to create meaningful insights, while keeping all rules of compliance
intact. These rules of compliance include GDPR, CCPA, Financial rules like Basel III, Safe
Harbor Act, and more.
FIGURE 11.7 Data from external sources.