Page 29 - Building Big Data Applications
P. 29
Chapter 2 Infrastructure and technology 23
Reduce (out_key, intermediate_value list) - > out_value list
The Reduce function written by the user will accept an intermediate key I, and
the set of values for the key.
It will merge together these values to form a possibly smaller set of values.
Reducer outputs are just zero or one output value per invocation.
The intermediate values are supplied to the reduce function via an iterator. The
iterator function allows us to handle large lists of values that cannot fit in mem-
ory or a single pass.
MapReduce Google architecture
In the original architecture that Google proposed and implemented, MapReduce con-
sisted of the architecture and components as described in Fig. 2.3. The key pieces of the
architecture include the following:
A GFS cluster
A single master
Multiple chunkservers (workers or slaves) per master
Accessed by multiple clients
Running on commodity Linux machines
A file
Represented as fixed-sized chunks
Labeled with 64-bit unique global IDs
Stored at chunkservers and 3-way mirrored across chunkservers
In the GFS cluster, input data files are divided into chunks (64 MB is the standard
chunk size), each assigned its unique 64-bit handle, and stored on local chunkserver
systems as files. To ensure fault tolerance and scalability, each chunk is replicated at
least once on another server, and the default design is to create three copies of a chunk
(Fig. 2.4).
FIGURE 2.3 Clienteserver architecture.