Page 237 - Data Architecture

P. 237

Chapter 6.2: Introduction to Data Vault Modeling
The issue with a multipart business key is with performance of a join. There are multiple
mathematical tests and quantitative results that show time and time again that multifield
join criteria are slower than single field join criteria. It only goes “slower” in large volume
or big data solutions. At this point, perhaps, a hash key or surrogate sequence in the data

vault may be faster than a multifield join because it reduces the join back to a single field
value.

Another alternative is to concatenate the multifield values together, thus forming
somewhat of an intelligent key, either with or without delimiters. This would depend on
how the business wishes to define a set standard for concatenating the multifield values
(i.e., the rules needed—just like the rules needed to define a smart key).

The last thing to watch when choosing a multipart business key is the length of the
combined or concatenated field. If the length of the concatenated fields is longer than the
length of a hash result or surrogate sequence ID, then the join will execute slower than a
join on a shorter field. As a reminder, these differences in performance usually can only
be seen in large data sets (500 M or 1 billion records or more). The hardware has
advanced and will continue to advance so much so that small data sets exhibit good
performance. There is simply not enough of a difference in a small data set to make an
informed decision about the choice of the “primary key” for the hubs.

The suggestion ultimately is to rekey the source data solutions and add a smart or
intelligent key “up front” that can carry the data across instances, across business
processes, across upgrades, through master data, across hybrid environments, and never

change. Doing this would centralize and ease the pain and cost of “master data” and
would lead to easier use of a virtualization engine. It may not require complex analytics,
neural nets, or machine learning algorithms to tie the data sets back together later.

In fact, fixing these rekeying issues, according to one estimate, costs the business seven
times the money to “fix” this problem in the warehouse, instead of addressing it in the
source applications. Fixing the problem in the data warehouse is one form of technical
debt (quote and metrics paraphrased from Nols Ebersohn).

If the source system cannot be rekeyed or the source system cannot add an “intelligent”
or “smart key” that is a contextual key, the recommendation is to implement master data
management upstream. If MDM cannot be implemented, the next recommendation is
leverage the source system business keys (unless there are composite business keys)—in
which case, a hash is the base-level default recommendation.

237

232 233 234 235 236 237 238 239 240 241 242