Page 236 - Data Architecture
P. 236

Chapter 6.2: Introduction to Data Vault Modeling
           because no master data management solution has been implemented upstream of the data
           warehouse. Therefore, to put together what appears to be “one version of the customer
           record” and not double or triple count, algorithms are applied to bridge the keys together.


           In the data vault landscape, we call this a hierarchical or same-as link, hierarchical if it
           represents a multilevel hierarchy and same-as if it is a single hierarchy (parent to child
           remap) of terms.


           Placing these sequence numbers as business keys in hubs have the following issues:


               • They are meaningless—a human cannot determine what the key stands for (contextually) without
               examining the details for a moment in time.
               • They can change—often they do, even with something as “simple” as a source system upgrade—this
               results in a serious loss of traceability to the historical artifacts. Without an “old-key” to “new-key”
               map, there is no definitive traceability.
               • They can collide. Even though conceptually across the business there is one element called “customer
               account,” the same ID sequence may be assigned in different instances for different customer accounts.
               In this case, they should never be combined. An example of this would be two different implementations
               of SAP: one in Japan and one in Canada. Each assigns customer ID #1; however, in Japan's system, #1
               represents “Joe Johnson,” whereas in Canada's system, #1 represents “Margarite Smith.” The last thing
               you want in analytics is to “combine” these two records for reporting just because they have the same
               surrogate ID.


           An additional question arises if the choice is made to utilize data vault sequence numbers
           for hubs and the source system business keys are surrogates. The question is as follows:
           why “rekey” or “renumber” the original business key? Why not just use the original
           business key (which by the way is how the original hub is defined)?


           To stop the collision (as put forward in the example above)—whether a surrogate
           sequence, a hash key, or the source business key is chosen for the hub structure—another
           element must be added. This secondary element ensures uniqueness of this surrogate
           business key. One of the best practices here is to assign geography codes, for example,
           JAP for any customer account IDs that originate from Japans’ SAP instance and CAN for
           any customer account IDs that originate from Canadas’ SAP instance.



           Multipart Source Business Keys


           Using a geographic code, as mentioned above, brings up another issue. If the hub is
           created based solely on source system business key (and not surrogate sequence or hash

           key), then with the choice above (to add a geography code split), the model must be
           designed and built with a multipart business key.
                                                                                                               236
   231   232   233   234   235   236   237   238   239   240   241