Page 59 -
P. 59

58   Part I  •  Decision Making and Analytics: An Overview

                                    be stored in a single storage unit. Big Data typically refers to data that is arriving in
                                    many different forms, be they structured, unstructured, or in a stream. Major sources
                                    of such data are clickstreams from Web sites, postings on social media sites such as
                                    Facebook, or data from traffic, sensors, or weather. A Web search engine like Google
                                    needs to search and index billions of Web pages in order to give you relevant search
                                    results in a fraction of a second. Although this is not done in real time, generating an
                                    index of all the Web pages on the Internet is not an easy task. Luckily for Google, it
                                    was able to solve this problem. Among other tools, it has employed Big Data analytical
                                    techniques.
                                         There are two aspects to managing data on this scale: storing and processing. If we
                                    could purchase an extremely expensive storage solution to store all the data at one place
                                    on one unit, making this unit fault tolerant would involve major expense. An ingenious
                                    solution was proposed that involved storing this data in chunks on different machines
                                    connected by a network, putting a copy or two of this chunk in different locations on
                                    the network, both logically and physically. It was originally used at Google (then called
                                    Google File System) and later developed and released as an Apache project as the Hadoop
                                    Distributed File System (HDFS).
                                         However, storing this data is only half the problem. Data is worthless if it does
                                    not provide business value, and for it to provide business value, it has to be analyzed.
                                    How are such vast amounts of data analyzed? Passing all computation to one powerful
                                    computer does not work; this scale would create a huge overhead on such a power-
                                    ful computer. Another ingenious solution was proposed: Push computation to the data,
                                    instead of pushing data to a computing node. This was a new paradigm, and it gave rise
                                    to a whole new way of processing data. This is what we know today as the MapReduce
                                    programming paradigm, which made processing Big Data a reality. MapReduce was origi-
                                    nally developed at Google, and a subsequent version was released by the Apache project
                                    called Hadoop MapReduce.
                                         Today, when we talk about storing, processing, or analyzing Big Data, HDFS and
                                    MapReduce are involved at some level. Other relevant standards and software solutions
                                    have been proposed. Although the major toolkit is available as open source, several
                                    companies have been launched to provide training or specialized analytical hardware or
                                    software services in this space. Some examples are HortonWorks, Cloudera, and Teradata
                                    Aster.
                                         Over the past few years, what was called Big Data changed more and more as Big
                                    Data applications appeared. The need to process data coming in at a rapid rate added
                                    velocity to the equation. One example of fast data processing is algorithmic trading. It
                                    is the use of electronic platforms based on algorithms for trading shares on the financial
                                    market, which operates in the order of microseconds. The need to process different
                                    kinds of data added variety to the equation. Another example of the wide variety of
                                    data is sentiment analysis, which uses various forms of data from social media platforms
                                    and customer responses to gauge sentiments. Today Big Data is associated with almost
                                    any kind of large data that has the characteristics of volume, velocity, and variety.
                                    Application Case 1.7 illustrates one example of Big Data analytics. We will study Big
                                    Data characteristics in more detail in Chapters 3 and 13.



                                    sectiOn 1.9 revieW QuestiOns
                                      1. What is Big Data analytics?
                                      2. What are the sources of Big Data?
                                      3. What are the characteristics of Big Data?
                                      4. What processing technique is applied to process Bi ta?








           M01_SHAR9209_10_PIE_C01.indd   58                                                                      1/25/14   7:46 AM
   54   55   56   57   58   59   60   61   62   63   64