Page 232 - Data Architecture

P. 232

Chapter 6.2: Introduction to Data Vault Modeling
important because they remove dependencies when “loading” the Data Vault 2.0
structures. A hash key can be computed value by value. The “parent” key can also be
computed and can be repeated for as many parent keys as there exist values for. There is
no lookup dependency, no need to precache, use the temp area, or anything else to

calculate each parent value during load processing.

Big data system loads are nearly impossible to scale properly with sequence numbering

dependencies in place. Sequences (whether they are in DV1 or dimensional models or
any other data model) force the parent to be loaded and then the child structures. These
dependencies on “parent first—then lookup parent value” cause a sequential row-by-row
operation during the load cycles, thereby inhibiting the scale-out possibilities that
parallelism offers.

This type of dependency not only slows the loading process down but also kills any
potential for parallelism—even with referential integrity shutoff. Furthermore, it places a
dependency into the loading stream in heterogeneous environments. For instance, when
loading satellite data into Hadoop (perhaps a JavaScript Object Notation (JSON)
document), the loading stream requires a lookup for the sequence number from the hub
that may exist in a relational database. This dependency alone defeats the entire purpose
of having a system like Hadoop in the first place.

Hash keys do have their issues:

• Length of the resulting computational value when the storage for the hash is greater than sequences.
• Possible collision (probabilities of collision are dependent on the hashing function chosen for
utilization).

The first issue leads to slower SQL joins and slower queries. This is because it takes
longer to “match” or compare longer length fields than it does to compare numerics.
Hashes (in Oracle and SQL Server) are typically stored in fixed binary form (yes, this

works as a primary key). Hashes in Hive or other Hadoop-based technologies and some
other relational engines must store the hashes as fixed character set lengths. For example,
an MD5 hash result is BINARY(16), which results in CHAR(32) fixed length
hexadecimal encoded string.

The flip side of using a hash is its unlimited scalability in parallel loading. All data can be
loaded in complete parallel all the time across multiple platforms (even those that are
geographically split or split on-premise and in-cloud). Hash keys (or business keys) are
part of the success of Data Vault 2.0 in a big data and NoSQL world. Hashing is optional
in DV2. There are a variety of hashing algorithms available for use that include the
232

227 228 229 230 231 232 233 234 235 236 237