Page 203 - Building Big Data Applications
P. 203

Chapter 11   Data discovery and connectivity  203




























                                         FIGURE 11.6 The new infrastructure divide.
                 their data sets must coexist in equilibrium, and the data within them must be queried and
                 analyzed ina coordinated fashion. Thisproblemisa challenge forany dataarchitect,and itis
                 very hard to do when companies have separate tools for each broad class of data repository.
                   The true realization of achieving the harmonized orchestration does not just come
                 from tools that can work with either repository class. The winners are tools that work
                 across both and can bring them together. This is true for both query and analysis: tools
                 that can fetch data from both repository types join that data together and then present
                 visualized insights from that integration, definitely qualify.
                   The difference between lakes and swamps is much like the distinction between well-
                 organized and disorganized hard disks. Similarly, having a well-organized data catalog,
                 with lots of metadata applied to the data set files within it, makes those data sets more
                 discoverable and the data lake, as a whole, more usable. Beyond just having a good
                 handle on organization, for a catalog to be its best, data from distinct data sources must
                 really coalesce, or the integration at query time will be underserved. Does your data
                 catalog see data in both your data warehouse and data lake? If the answer is yes, can the
                 catalog tell you which data sets, from each, can or should be used together? In other
                 words, does your governance tool chronicle the relationships within and flows between
                 such heterogeneous data setsddo they perform foundational data discovery?
                   If the answer to all these questions is yes, you are in a good spot, but you are not done
                 yet. Because even if relationships and flows can be documented in the catalog, you then
                 need to determine if this work must be done manually or if the relationships and flows
                 can instead be detected on an automated basis. And even if automatic detection is
                 supported, you need to determine if it will work in instances when there is no schema
                 information that documents the relationships or flows.
   198   199   200   201   202   203   204   205   206   207   208