Page 383 -
P. 383

Chapter 9  Business Intelligence Systems
                382
                                                                •  Dirty data     •   Wrong granularity
                                                                •  Missing values       – Too fine
                                                                •  Inconsistent data       – Not fine enough
                                                                •  Data not integrated  •   Too much data
                Figure 9-14                                                            – Too many attributes
                Possible Problems with                                                 – Too many data points
                Source Data



                                            they sell. An organization buys such data because for some uses, some data is better than no data
                                            at all. This is especially true for data items whose values are difficult to obtain, such as Number
                                            of Adults in Household, Household Income, Dwelling Type, and Education of Primary Income
                                            Earner. However, care is required here because for some BI applications a few missing or errone-
                                            ous data points can seriously bias the analysis.
                                               Inconsistent data, the third problem in Figure 9-14, is particularly common for data that has
                                            been gathered over time. When an area code changes, for example, the phone number for a given
                                            customer before the change will not match the customer’s number afterward. Likewise, part codes
                                            can change, as can sales territories. Before such data can be used, it must be recoded for consis-
                                            tency over the period of the study.
                                               Some data inconsistencies occur from the nature of the business activity. Consider a Web-
                                            based order-entry system used  by customers  worldwide. When  the Web server records  the
                                            time of order, which time zone does it use? The server’s system clock time is irrelevant to an
                                            analysis of customer behavior. Coordinated Universal Time (formerly called Greenwich Mean
                                            Time) is also meaningless. Somehow, Web server time must be adjusted to the time zone of the
                                            customer.
                                               Another problem is nonintegrated data. A particular BI analysis might require data from
                                            an ERP system, an e-commerce system, and a social networking application. Analysts may wish
                                            to integrate that organizational data with purchased consumer data. Such a data collection will
                                            likely have relationships that are not represented in primary key/foreign key relationships. It is the
                                            function of personnel in the data warehouse to integrate such data somehow.
                                               Data can also have the wrong granularity, a term that refers to the level of detail repre-
                                            sented by the data. Granularity can be too fine or too coarse. For the former, suppose we want
                                            to analyze the placement of graphics and controls on an order-entry Web page. It is possible to
                                            capture the customers’ clicking behavior in what is termed clickstream data. Those data, however,
                                            include everything the customer does at the Web site. In the middle of the order stream are data
                                            for clicks on the news, email, instant chat, and a weather check. Although all of that data may be
                                            useful for a study of consumer browsing behavior, it will be overwhelming if all we want to know
                                            is how customers respond to an ad located differently on the screen. To proceed, the data analysts
                                            must throw away millions and millions of clicks.
                                               Data can also be too coarse. For example, a file of regional sales totals cannot be used to
                                            investigate the sales in a particular store in a region, and total sales for a store cannot be used to
                                            determine the sales of particular items within a store. Instead, we need to obtain data that is fine
                                            enough for the lowest-level report we want to produce.
                                               In general, it is better to have too fine a granularity than too coarse. If the granularity is too
                                            fine, the data can be made coarser by summing and combining. This is what team members did
                                            with the sales data in Figure 9-6. Sales by Bill Year were too fine for their needs, so they summed
                                            sales data over those years. If the granularity is too coarse, however, there is no way to separate
                                            the data into constituent parts.
                                               The final problem listed in Figure 9-14 is to have too much data. As shown in the figure,
                                            we can have either too many attributes or too many data points. Think back to the discussion of
                                            tables in Chapter 5. We can have too many columns or too many rows.
   378   379   380   381   382   383   384   385   386   387   388