Page 31 - Building Big Data Applications
P. 31
Chapter 2 Infrastructure and technology 25
failure quickly. When a corruption is detected, with a combination of frequent
checkpoints, snapshots, and replicas, data is recovered with minimal chance of data
loss. The architecture results in data unavailability for a short period but not data
corruption.
The GFS architecture has the following strengths:
Availability
Triple replication-based redundancy (or more if you choose)
Chunk replication
Rapid failovers for any master failure
Automatic replication management
Performance
The biggest workload for GFS is read on large data sets, which based on the ar-
chitecture discussion, will be a nonissue.
There are minimal writes to the chunks directly, thus providing auto availability
Management
GFS manages itself through multiple failure modes
Automatic load balancing
Storage management and pooling
Chunk management
Failover management
Cost
Is not a constraint due to use of commodity hardware and Linux platforms
The platforms combined together along with proprietary techniques enabled Google
and other companies that adopted the technologies and customized it further to enable
performance within their organizations.
A pureplay architecture of MapReduce þ GFS (or other similar filesystem) de-
ployments can become messy to manage on large environments. Google has created
multiple proprietary layers that cannot be adapted by any organization. In order to
ensure management and deployment, the most extensible and successful platform for
MapReduce is Hadoop, which we will discuss in later sections of this chapter. There are
many variants of MapReduce programming today including SQL-MapReduce
(AsterData), GreenplumMapReduce, MapReduce with Ruby, MongoDBMapReduce to
name a few.
Hadoop
The most popular word in the industry at the time of writing this book, Hadoop has
taken the world by storm in providing the solution architecture to solve big data pro-
cessing on a cheaper commodity platform with faster scalability and parallel processing.
This section’s goal is to introduce you to Hadoop and cover the core components of
Hadoop.