Page 171 -
P. 171

11-ch04-125-186-9780123814791
                                                            2011/6/1
                         HAN
          134   Chapter 4 Data Warehousing and Online Analytical Processing  3:17 Page 134  #10



                         warehouse based on the same corporate data model set noted before. Third, distributed
                         data marts can be constructed to integrate different data marts via hub servers. Finally,
                         a multitier data warehouse is constructed where the enterprise warehouse is the sole
                         custodian of all warehouse data, which is then distributed to the various dependent
                         data marts.


                   4.1.6 Extraction, Transformation, and Loading

                         Data warehouse systems use back-end tools and utilities to populate and refresh their
                         data (Figure 4.1). These tools and utilities include the following functions:

                           Data extraction, which typically gathers data from multiple, heterogeneous, and
                           external sources.
                           Data cleaning, which detects errors in the data and rectifies them when possible.
                           Data transformation, which converts data from legacy or host format to warehouse
                           format.
                           Load, which sorts, summarizes, consolidates, computes views, checks integrity, and
                           builds indices and partitions.
                           Refresh, which propagates the updates from the data sources to the warehouse.

                         Besides cleaning, loading, refreshing, and metadata definition tools, data warehouse
                         systems usually provide a good set of data warehouse management tools.
                           Data cleaning and data transformation are important steps in improving the data
                         quality and, subsequently, the data mining results (see Chapter 3). Because we are mostly
                         interested in the aspects of data warehousing technology related to data mining, we will
                         not get into the details of the remaining tools, and recommend interested readers to
                         consult books dedicated to data warehousing technology.



                   4.1.7 Metadata Repository
                         Metadata are data about data. When used in a data warehouse, metadata are the data
                         that define warehouse objects. Figure 4.1 showed a metadata repository within the bot-
                         tom tier of the data warehousing architecture. Metadata are created for the data names
                         and definitions of the given warehouse. Additional metadata are created and captured
                         for timestamping any extracted data, the source of the extracted data, and missing fields
                         that have been added by data cleaning or integration processes.
                           A metadata repository should contain the following:

                           A description of the data warehouse structure, which includes the warehouse schema,
                           view, dimensions, hierarchies, and derived data definitions, as well as data mart
                           locations and contents.
   166   167   168   169   170   171   172   173   174   175   176