Page 230 - Data Architecture

P. 230

Chapter 6.2: Introduction to Data Vault Modeling
• Upper limit (the size of the numeric field for nondecimal values).
• Introduce process issue when utilizing sequences during load because they require any child entity to
look up its corresponding parent record to inherit the parent value.
• Hold no business meaning.

The most critical of the issues above is that of negative performance impacts associated
with lookup or join processes, particularly in heterogeneous environments or in
environments where data are legally not allowed to “live” or be replicated on to other
environments (geographically split or on-premise and in-cloud mix). This process issue is
exacerbated during high-speed IOT or real-time feeds. Consider what happens in an IOT
or real-time feed when data flow quickly to billions of child records, and each record
must then wait on a sequence “lookup” (one record at a time); the real-time stream may
back up.

Lookups also cause “precaching” problems under volume loads. For example, suppose
the parent table is invoice and the child table is order. If the invoice table has 500 million

records and the order table has 5 billion records and each order has at least one matching
parent row (most likely more), then each record that flows into order must “look up” at
least one invoice. This lookup process will happen 5 billion times, once for each child
record.

It doesn’t matter if the technology is an ETL engine, real-time process engine, or SQL
data management-enabled engine. This process must happen to avoid any potential
orphan records. If the referential integrity is shut off, the load process can run in parallel
to both tables. However, to populate the “parent sequence,” it must still be
“searched/looked up” on a row-by-row basis. Adding parallelism and partitioning will
help with the performance, but eventually, it will hit an upper limit bottleneck.

In an MPP environment (MPP storage), the data will be redistributed to allow the join to
occur, and it is not just the sequence that has to be shipped—it’s the sequence PLUS the

entire business key that it is tied to. In an MPP engine with non-MPP storage (like
snowflake DB), the data don’t have to be shipped, but the lookup process still must
happen.

This act of a single-strung, one record at a time lookup can tremendously (and negatively)
impact load performance. In large-scale solutions (think of 1000 “tables” or data sets
each with 1 billion records or more), this performance problem is dramatically increased
(load times are dramatically increased).

What if there is one child table? What if the data model design has parent->child->child-
230

225 226 227 228 229 230 231 232 233 234 235