Page 232 - Data Architecture
P. 232

Chapter 6.2: Introduction to Data Vault Modeling
           important because they remove dependencies when “loading” the Data Vault 2.0
           structures. A hash key can be computed value by value. The “parent” key can also be
           computed and can be repeated for as many parent keys as there exist values for. There is
           no lookup dependency, no need to precache, use the temp area, or anything else to

           calculate each parent value during load processing.

           Big data system loads are nearly impossible to scale properly with sequence numbering

           dependencies in place. Sequences (whether they are in DV1 or dimensional models or
           any other data model) force the parent to be loaded and then the child structures. These
           dependencies on “parent first—then lookup parent value” cause a sequential row-by-row
           operation during the load cycles, thereby inhibiting the scale-out possibilities that
           parallelism offers.


           This type of dependency not only slows the loading process down but also kills any
           potential for parallelism—even with referential integrity shutoff. Furthermore, it places a
           dependency into the loading stream in heterogeneous environments. For instance, when
           loading satellite data into Hadoop (perhaps a JavaScript Object Notation (JSON)
           document), the loading stream requires a lookup for the sequence number from the hub
           that may exist in a relational database. This dependency alone defeats the entire purpose
           of having a system like Hadoop in the first place.


           Hash keys do have their issues:


               • Length of the resulting computational value when the storage for the hash is greater than sequences.
               • Possible collision (probabilities of collision are dependent on the hashing function chosen for
               utilization).

           The first issue leads to slower SQL joins and slower queries. This is because it takes
           longer to “match” or compare longer length fields than it does to compare numerics.
           Hashes (in Oracle and SQL Server) are typically stored in fixed binary form (yes, this

           works as a primary key). Hashes in Hive or other Hadoop-based technologies and some
           other relational engines must store the hashes as fixed character set lengths. For example,
           an MD5 hash result is BINARY(16), which results in CHAR(32) fixed length
           hexadecimal encoded string.


           The flip side of using a hash is its unlimited scalability in parallel loading. All data can be
           loaded in complete parallel all the time across multiple platforms (even those that are
           geographically split or split on-premise and in-cloud). Hash keys (or business keys) are
           part of the success of Data Vault 2.0 in a big data and NoSQL world. Hashing is optional
           in DV2. There are a variety of hashing algorithms available for use that include the
                                                                                                               232
   227   228   229   230   231   232   233   234   235   236   237