Page 237 - Data Architecture
P. 237

Chapter 6.2: Introduction to Data Vault Modeling
           The issue with a multipart business key is with performance of a join. There are multiple
           mathematical tests and quantitative results that show time and time again that multifield
           join criteria are slower than single field join criteria. It only goes “slower” in large volume
           or big data solutions. At this point, perhaps, a hash key or surrogate sequence in the data

           vault may be faster than a multifield join because it reduces the join back to a single field
           value.


           Another alternative is to concatenate the multifield values together, thus forming
           somewhat of an intelligent key, either with or without delimiters. This would depend on
           how the business wishes to define a set standard for concatenating the multifield values
           (i.e., the rules needed—just like the rules needed to define a smart key).


           The last thing to watch when choosing a multipart business key is the length of the
           combined or concatenated field. If the length of the concatenated fields is longer than the
           length of a hash result or surrogate sequence ID, then the join will execute slower than a
           join on a shorter field. As a reminder, these differences in performance usually can only
           be seen in large data sets (500 M or 1 billion records or more). The hardware has
           advanced and will continue to advance so much so that small data sets exhibit good
           performance. There is simply not enough of a difference in a small data set to make an
           informed decision about the choice of the “primary key” for the hubs.


           The suggestion ultimately is to rekey the source data solutions and add a smart or
           intelligent key “up front” that can carry the data across instances, across business
           processes, across upgrades, through master data, across hybrid environments, and never

           change. Doing this would centralize and ease the pain and cost of “master data” and
           would lead to easier use of a virtualization engine. It may not require complex analytics,
           neural nets, or machine learning algorithms to tie the data sets back together later.


           In fact, fixing these rekeying issues, according to one estimate, costs the business seven
           times the money to “fix” this problem in the warehouse, instead of addressing it in the
           source applications. Fixing the problem in the data warehouse is one form of technical
           debt (quote and metrics paraphrased from Nols Ebersohn).


           If the source system cannot be rekeyed or the source system cannot add an “intelligent”
           or “smart key” that is a contextual key, the recommendation is to implement master data
           management upstream. If MDM cannot be implemented, the next recommendation is
           leverage the source system business keys (unless there are composite business keys)—in
           which case, a hash is the base-level default recommendation.



                                                                                                               237
   232   233   234   235   236   237   238   239   240   241   242