Page 256 -
P. 256
Chapter 6 Foundations of Business Intelligence: Databases and Information Management 255
Data Warehouses and Data Marts
The traditional tool for analyzing corporate data for the past two decades has
been the data warehouse. A data warehouse is a database that stores current
and historical data of potential interest to decision makers throughout the
company. The data originate in many core operational transaction systems,
such as systems for sales, customer accounts, and manufacturing, and may
include data from Web site transactions. The data warehouse extracts current
and historical data from multiple operational systems inside the organization.
These data are combined with data from external sources and transformed
by correcting inaccurate and incomplete data and restructuring the data
for management reporting and analysis before being loaded into the data
warehouse.
The data warehouse makes the data available for anyone to access as needed,
but it cannot be altered. A data warehouse system also provides a range of
ad hoc and standardized query tools, analytical tools, and graphical reporting
facilities .
Companies often build enterprise-wide data warehouses, where a central
data warehouse serves the entire organization, or they create smaller, decentral-
ized warehouses called data marts. A data mart is a subset of a data warehouse
in which a summarized or highly focused portion of the organization’s data is
placed in a separate database for a specific population of users. For example,
a company might develop marketing and sales data marts to deal with cus-
tomer information. Bookseller Barnes & Noble used to maintain a series of data
marts—one for point-of-sale data in retail stores, another for college bookstore
sales, and a third for online sales.
Hadoop
Relational DBMS and data warehouse products are not well-suited for organiz-
ing and analyzing big data or data that do not easily fit into columns and rows
used in their data models. For handling unstructured and semi-structured data
in vast quantities, as well as structured data, organizations are using Hadoop.
Hadoop is an open source software framework managed by the Apache
Software Foundation that enables distributed parallel processing of huge
amounts of data across inexpensive computers. It breaks a big data problem
down into sub-problems, distributes them among up to thousands of inexpen-
sive computer processing nodes, and then combines the result into a smaller
data set that is easier to analyze. You’ve probably used Hadoop to find the best
airfare on the Internet, get directions to a restaurant, do a search on Google, or
connect with a friend on Facebook.
Hadoop consists of several key services: the Hadoop Distributed File System
(HDFS) for data storage and MapReduce for high-performance parallel data
processing. HDFS links together the file systems on the numerous nodes in a
Hadoop cluster to turn them into one big file system. Hadoop’s MapReduce was
inspired by Google’s MapReduce system for breaking down processing of huge
datasets and assigning work to the various nodes in a cluster. HBase, Hadoop’s
non-relational database, provides rapid access to the data stored on HDFS and a
transactional platform for running high-scale real-time applications.
Hadoop can process large quantities of any kind of data, including structured
transactional data, loosely structured data such as Facebook and Twitter feeds,
complex data such as Web server log files, and unstructured audio and video
data. Hadoop runs on a cluster of inexpensive servers, and processors can be
added or removed as needed. Companies use Hadoop for analyzing very large
MIS_13_Ch_06 Global.indd 255 1/17/2013 2:27:43 PM