Page 203 - Building Big Data Applications
P. 203
Chapter 11 Data discovery and connectivity 203
FIGURE 11.6 The new infrastructure divide.
their data sets must coexist in equilibrium, and the data within them must be queried and
analyzed ina coordinated fashion. Thisproblemisa challenge forany dataarchitect,and itis
very hard to do when companies have separate tools for each broad class of data repository.
The true realization of achieving the harmonized orchestration does not just come
from tools that can work with either repository class. The winners are tools that work
across both and can bring them together. This is true for both query and analysis: tools
that can fetch data from both repository types join that data together and then present
visualized insights from that integration, definitely qualify.
The difference between lakes and swamps is much like the distinction between well-
organized and disorganized hard disks. Similarly, having a well-organized data catalog,
with lots of metadata applied to the data set files within it, makes those data sets more
discoverable and the data lake, as a whole, more usable. Beyond just having a good
handle on organization, for a catalog to be its best, data from distinct data sources must
really coalesce, or the integration at query time will be underserved. Does your data
catalog see data in both your data warehouse and data lake? If the answer is yes, can the
catalog tell you which data sets, from each, can or should be used together? In other
words, does your governance tool chronicle the relationships within and flows between
such heterogeneous data setsddo they perform foundational data discovery?
If the answer to all these questions is yes, you are in a good spot, but you are not done
yet. Because even if relationships and flows can be documented in the catalog, you then
need to determine if this work must be done manually or if the relationships and flows
can instead be detected on an automated basis. And even if automatic detection is
supported, you need to determine if it will work in instances when there is no schema
information that documents the relationships or flows.