Page 233 - Data Architecture

P. 233

Chapter 6.2: Introduction to Data Vault Modeling
following:

• MD5 (deprecated circa 2018)
• SHA 0, 1, 2, and 3—SHA1 (deprecated circa 2018)
• Perfect hashes

The hash is based on the business keys that arrive in the staging areas. All lookup
dependencies are hence removed, and the entire system can load in parallel across
heterogeneous environments. The data set in the model now can be spread across MPP
environments by selecting the hash value as the distribution key. This allows for better
mostly random, mostly even distribution across the MPP nodes if the hash key is the
MPP bucket distribution key.

“When testing a hash function, the uniformity of the distribution of hash values can be
evaluated by the chi-squared test.” https://en.wikipedia.org/wiki/Hash_function

Luckily, the hash functions are already designed, and the designers have taken this bit of
distribution mathematics into account. The hashing function chosen (if hashing is to be

utilized) can be at the discretion of the design team. As of circa 2018, teams have chosen
SHA-256.

One of the items discussed is the longer the hashing output (number of bits), the less
likely/less probable for a potential collision. This is something to take into consideration,
especially if the data sets are large (e.g., big data, 1 billion records on input per load cycle
per table).

If a hash key is chosen for implementation, then a hash collision strategy must also be
designed. This is the responsibility of the team. There are several options available for
addressing hash collisions. One of the recommended strategies is reverse hash.

This is just for the Data Vault 2.0 model that acts as the enterprise warehouse. It is still
possible (and even advisable) to utilize or leverage sequence numbers in persisted
information marts (data marts) downstream to engage fastest possible joins within a
homogeneous environment.

The largest benefit isn’t from the modeling side of the house; it's from the loading and
querying perspectives. For loading, it releases the dependencies and allows loads to
Hadoop and other NoSQL environments in parallel with loads to RDBMS systems. For

querying, it allows “late-join” or run-time binding of data across Java database
connectivity (JDBC) and open database connectivity (ODBC) between Hadoop, NoSQL,
233

228 229 230 231 232 233 234 235 236 237 238