Page 231 - Data Architecture
P. 231

Chapter 6.2: Introduction to Data Vault Modeling
           >child tables? Or relationships that are multiple levels deep? Then, the problem escalates
           as the length of the load cycles escalates exponentially.


           To be fair, let's now address some of the positive notions of utilizing sequence numbers.
           Sequence numbers have the following positive impacts once established:


                                                                               ∧
               • Small byte size (generally less than number(38)) (38 “9’s”) or 10   125.
               • Process benefit: joins across tables can leverage small byte size comparisons.
               • Process benefit: joins can leverage numeric comparisons (faster than character or binary
               comparisons).
               • Always unique for each new record inserted.
               • Some engines can further partition (group) in ascending order the numerical sequences and leverage
               subpartition (micropartition) pruning by leveraging range selection during the join process (in parallel).


           Hash Keys


           What is a hash key? A hash key is a business key (may be composite fields) run through a
           computational function called a hash and then assigned as the primary key of the table.
           Hash functions are called deterministic. Being deterministic means that based on given

           input X (every single time the hash function is provided X), it will produce output Y (for
           the same input, the same output will be generated). Definitions of hash functions, what
           they are and how they work, can be found on Wikipedia.


           Hash key benefits to any data model:

               • 100% parallel independent load processes (if referential integrity is shut off) even if these load
               processes are split on multiple platforms or multiple locations.
               • Lazy joins—that is, the ability to join across multiple platforms utilizing technology like drill (or
               something similar)—even without referential integrity. Note that lazy joins can’t be accomplished
               across heterogeneous platform environments and aren’t even supported in some NoSQL engines.
               • Single field primary key attribute (same benefit here as the sequence numbering solution).
               • Deterministic—it can even be precomputed on the source systems or at the edge for IOT devices/edge
               computing.
               • Can represent unstructured and multistructured data sets—based on specific input hash keys can be
               calculated again and again (in parallel). In other words, a hash key can be constructed as a business key
               for audio, images, video, and documents. This is something sequences cannot do in a deterministic
               fashion.
               • If there is a desire to build a smart hash function, then meaning can be assigned to bits of the hash
               (similar to teradata—and what it computes for the underlying storage and data access).

           Hash keys are important to Data Vault 2.0 because of the efforts to connect

           heterogeneous data environments such as Hadoop and Oracle. Hash keys are also

                                                                                                               231
   226   227   228   229   230   231   232   233   234   235   236