Page 233 - Data Architecture
P. 233

Chapter 6.2: Introduction to Data Vault Modeling
           following:


               • MD5 (deprecated circa 2018)
               • SHA 0, 1, 2, and 3—SHA1 (deprecated circa 2018)
               • Perfect hashes


           The hash is based on the business keys that arrive in the staging areas. All lookup
           dependencies are hence removed, and the entire system can load in parallel across
           heterogeneous environments. The data set in the model now can be spread across MPP
           environments by selecting the hash value as the distribution key. This allows for better
           mostly random, mostly even distribution across the MPP nodes if the hash key is the
           MPP bucket distribution key.


           “When testing a hash function, the uniformity of the distribution of hash values can be
           evaluated by the chi-squared test.” https://en.wikipedia.org/wiki/Hash_function


           Luckily, the hash functions are already designed, and the designers have taken this bit of
           distribution mathematics into account. The hashing function chosen (if hashing is to be

           utilized) can be at the discretion of the design team. As of circa 2018, teams have chosen
           SHA-256.


           One of the items discussed is the longer the hashing output (number of bits), the less
           likely/less probable for a potential collision. This is something to take into consideration,
           especially if the data sets are large (e.g., big data, 1 billion records on input per load cycle
           per table).


           If a hash key is chosen for implementation, then a hash collision strategy must also be
           designed. This is the responsibility of the team. There are several options available for
           addressing hash collisions. One of the recommended strategies is reverse hash.


           This is just for the Data Vault 2.0 model that acts as the enterprise warehouse. It is still
           possible (and even advisable) to utilize or leverage sequence numbers in persisted
           information marts (data marts) downstream to engage fastest possible joins within a
           homogeneous environment.


           The largest benefit isn’t from the modeling side of the house; it's from the loading and
           querying perspectives. For loading, it releases the dependencies and allows loads to
           Hadoop and other NoSQL environments in parallel with loads to RDBMS systems. For

           querying, it allows “late-join” or run-time binding of data across Java database
           connectivity (JDBC) and open database connectivity (ODBC) between Hadoop, NoSQL,
                                                                                                               233
   228   229   230   231   232   233   234   235   236   237   238