Page 256 -
P. 256

Chapter 6 Foundations of Business Intelligence: Databases and Information Management 255


               Data Warehouses and Data Marts
               The traditional tool for analyzing corporate data for the past two decades has
               been the data warehouse. A data warehouse is a database that stores  current
               and historical data of potential interest to decision makers throughout the
                 company. The data originate in many core operational transaction systems,
               such as  systems for sales, customer accounts, and manufacturing, and may
               include data from Web site transactions. The data warehouse extracts  current
               and  historical data from multiple operational systems inside the organization.
               These data are combined with data from external sources and  transformed
               by correcting  inaccurate and incomplete data and restructuring the data
               for  management reporting and analysis before being loaded into the data
               warehouse.
                  The data warehouse makes the data available for anyone to access as needed,
               but it cannot be altered. A data warehouse system also provides a range of
               ad hoc and standardized query tools, analytical tools, and graphical reporting
               facilities .
                  Companies often build enterprise-wide data warehouses, where a central
               data warehouse serves the entire organization, or they create smaller, decentral-
               ized warehouses called data marts. A data mart is a subset of a data warehouse
               in which a summarized or highly focused portion of the organization’s data is
               placed in a separate database for a specific population of users. For example,
               a company might develop marketing and sales data marts to deal with cus-
               tomer information. Bookseller Barnes & Noble used to maintain a series of data
                 marts—one for point-of-sale data in retail stores, another for college bookstore
               sales, and a third for online sales.


               Hadoop
               Relational DBMS and data warehouse products are not well-suited for organiz-
               ing and analyzing big data or data that do not easily fit into columns and rows
               used in their data models. For handling unstructured and semi-structured data
               in vast quantities, as well as structured data, organizations are using Hadoop.
               Hadoop is an open source software framework managed by the Apache
               Software Foundation that enables distributed parallel processing of huge
               amounts of data across inexpensive computers. It breaks a big data problem
               down into  sub-problems, distributes them among up to thousands of inexpen-
               sive  computer processing nodes, and then combines the result into a smaller
               data set that is easier to analyze. You’ve probably used Hadoop to find the best
               airfare on the Internet, get directions to a restaurant, do a search on Google, or
               connect with a friend on Facebook.
                  Hadoop consists of several key services: the Hadoop Distributed File System
               (HDFS) for data storage and MapReduce for high-performance parallel data
               processing. HDFS links together the file systems on the numerous nodes in a
               Hadoop cluster to turn them into one big file system. Hadoop’s MapReduce was
               inspired by Google’s MapReduce system for breaking down processing of huge
               datasets and assigning work to the various nodes in a cluster. HBase, Hadoop’s
               non-relational database, provides rapid access to the data stored on HDFS and a
               transactional platform for running high-scale real-time applications.
                  Hadoop can process large quantities of any kind of data, including structured
               transactional data, loosely structured data such as Facebook and Twitter feeds,
               complex data such as Web server log files, and unstructured audio and video
               data. Hadoop runs on a cluster of inexpensive servers, and processors can be
               added or removed as needed. Companies use Hadoop for analyzing very large







   MIS_13_Ch_06 Global.indd   255                                                                             1/17/2013   2:27:43 PM
   251   252   253   254   255   256   257   258   259   260   261