Page 31 - Building Big Data Applications
P. 31

Chapter 2   Infrastructure and technology  25


                 failure quickly. When a corruption is detected, with a combination of frequent
                 checkpoints, snapshots, and replicas, data is recovered with minimal chance of data
                 loss. The architecture results in data unavailability for a short period but not data
                 corruption.
                   The GFS architecture has the following strengths:
                   Availability
                     Triple replication-based redundancy (or more if you choose)
                     Chunk replication
                     Rapid failovers for any master failure
                     Automatic replication management
                   Performance
                     The biggest workload for GFS is read on large data sets, which based on the ar-
                      chitecture discussion, will be a nonissue.
                     There are minimal writes to the chunks directly, thus providing auto availability
                   Management
                     GFS manages itself through multiple failure modes
                     Automatic load balancing
                     Storage management and pooling
                     Chunk management
                     Failover management
                   Cost
                     Is not a constraint due to use of commodity hardware and Linux platforms

                   The platforms combined together along with proprietary techniques enabled Google
                 and other companies that adopted the technologies and customized it further to enable
                 performance within their organizations.
                   A pureplay architecture of MapReduce þ GFS (or other similar filesystem) de-
                 ployments can become messy to manage on large environments. Google has created
                 multiple proprietary layers that cannot be adapted by any organization. In order to
                 ensure management and deployment, the most extensible and successful platform for
                 MapReduce is Hadoop, which we will discuss in later sections of this chapter. There are
                 many variants of MapReduce programming today including SQL-MapReduce
                 (AsterData), GreenplumMapReduce, MapReduce with Ruby, MongoDBMapReduce to
                 name a few.


                 Hadoop
                 The most popular word in the industry at the time of writing this book, Hadoop has
                 taken the world by storm in providing the solution architecture to solve big data pro-
                 cessing on a cheaper commodity platform with faster scalability and parallel processing.
                 This section’s goal is to introduce you to Hadoop and cover the core components of
                 Hadoop.
   26   27   28   29   30   31   32   33   34   35   36