Page 59 -
P. 59
58 Part I • Decision Making and Analytics: An Overview
be stored in a single storage unit. Big Data typically refers to data that is arriving in
many different forms, be they structured, unstructured, or in a stream. Major sources
of such data are clickstreams from Web sites, postings on social media sites such as
Facebook, or data from traffic, sensors, or weather. A Web search engine like Google
needs to search and index billions of Web pages in order to give you relevant search
results in a fraction of a second. Although this is not done in real time, generating an
index of all the Web pages on the Internet is not an easy task. Luckily for Google, it
was able to solve this problem. Among other tools, it has employed Big Data analytical
techniques.
There are two aspects to managing data on this scale: storing and processing. If we
could purchase an extremely expensive storage solution to store all the data at one place
on one unit, making this unit fault tolerant would involve major expense. An ingenious
solution was proposed that involved storing this data in chunks on different machines
connected by a network, putting a copy or two of this chunk in different locations on
the network, both logically and physically. It was originally used at Google (then called
Google File System) and later developed and released as an Apache project as the Hadoop
Distributed File System (HDFS).
However, storing this data is only half the problem. Data is worthless if it does
not provide business value, and for it to provide business value, it has to be analyzed.
How are such vast amounts of data analyzed? Passing all computation to one powerful
computer does not work; this scale would create a huge overhead on such a power-
ful computer. Another ingenious solution was proposed: Push computation to the data,
instead of pushing data to a computing node. This was a new paradigm, and it gave rise
to a whole new way of processing data. This is what we know today as the MapReduce
programming paradigm, which made processing Big Data a reality. MapReduce was origi-
nally developed at Google, and a subsequent version was released by the Apache project
called Hadoop MapReduce.
Today, when we talk about storing, processing, or analyzing Big Data, HDFS and
MapReduce are involved at some level. Other relevant standards and software solutions
have been proposed. Although the major toolkit is available as open source, several
companies have been launched to provide training or specialized analytical hardware or
software services in this space. Some examples are HortonWorks, Cloudera, and Teradata
Aster.
Over the past few years, what was called Big Data changed more and more as Big
Data applications appeared. The need to process data coming in at a rapid rate added
velocity to the equation. One example of fast data processing is algorithmic trading. It
is the use of electronic platforms based on algorithms for trading shares on the financial
market, which operates in the order of microseconds. The need to process different
kinds of data added variety to the equation. Another example of the wide variety of
data is sentiment analysis, which uses various forms of data from social media platforms
and customer responses to gauge sentiments. Today Big Data is associated with almost
any kind of large data that has the characteristics of volume, velocity, and variety.
Application Case 1.7 illustrates one example of Big Data analytics. We will study Big
Data characteristics in more detail in Chapters 3 and 13.
sectiOn 1.9 revieW QuestiOns
1. What is Big Data analytics?
2. What are the sources of Big Data?
3. What are the characteristics of Big Data?
4. What processing technique is applied to process Bi ta?
M01_SHAR9209_10_PIE_C01.indd 58 1/25/14 7:46 AM