Page 230 - Data Architecture
P. 230

Chapter 6.2: Introduction to Data Vault Modeling
               • Upper limit (the size of the numeric field for nondecimal values).
               • Introduce process issue when utilizing sequences during load because they require any child entity to
               look up its corresponding parent record to inherit the parent value.
               • Hold no business meaning.


           The most critical of the issues above is that of negative performance impacts associated
           with lookup or join processes, particularly in heterogeneous environments or in
           environments where data are legally not allowed to “live” or be replicated on to other
           environments (geographically split or on-premise and in-cloud mix). This process issue is
           exacerbated during high-speed IOT or real-time feeds. Consider what happens in an IOT
           or real-time feed when data flow quickly to billions of child records, and each record
           must then wait on a sequence “lookup” (one record at a time); the real-time stream may
           back up.


           Lookups also cause “precaching” problems under volume loads. For example, suppose
           the parent table is invoice and the child table is order. If the invoice table has 500 million

           records and the order table has 5 billion records and each order has at least one matching
           parent row (most likely more), then each record that flows into order must “look up” at
           least one invoice. This lookup process will happen 5 billion times, once for each child
           record.


           It doesn’t matter if the technology is an ETL engine, real-time process engine, or SQL
           data management-enabled engine. This process must happen to avoid any potential
           orphan records. If the referential integrity is shut off, the load process can run in parallel
           to both tables. However, to populate the “parent sequence,” it must still be
           “searched/looked up” on a row-by-row basis. Adding parallelism and partitioning will
           help with the performance, but eventually, it will hit an upper limit bottleneck.


           In an MPP environment (MPP storage), the data will be redistributed to allow the join to
           occur, and it is not just the sequence that has to be shipped—it’s the sequence PLUS the

           entire business key that it is tied to. In an MPP engine with non-MPP storage (like
           snowflake DB), the data don’t have to be shipped, but the lookup process still must
           happen.


           This act of a single-strung, one record at a time lookup can tremendously (and negatively)
           impact load performance. In large-scale solutions (think of 1000 “tables” or data sets
           each with 1 billion records or more), this performance problem is dramatically increased
           (load times are dramatically increased).


           What if there is one child table? What if the data model design has parent->child->child-
                                                                                                               230
   225   226   227   228   229   230   231   232   233   234   235