Page 30 - Building Big Data Applications
P. 30

24 Building Big Data Applications






















                            FIGURE 2.4 Google MapReduce cluster. Image sourcedGoogle briefing.


                If there is only one master there is a potential bottleneck in the architecture right? The
             role of the master is to communicate to the clients: chunkservers have what chunks and
             their metadata information. Client’s tasks then interact directly with chunkservers for all
             subsequent operations, and use the master only in a minimal fashion. The master
             therefore never becomes or is in a position to become the bottleneck.
                Another important issue to understand in the GFS architecture is the single point of
             failure (SPOF) of the master node and all the metadata that keeps track of the chunks
             and their state. To avoid this situation, GFS was designed to have the master keep data in
             memory for speed, keep a log on the master’s local disk, and replicate the disk across
             remote nodes. This way if there is a crash in the master node, a shadow can be up and
             running almost instantly.
                The master stores three types of metadata:
               File and chunk names or namespaces
               Mapping from files to chunks, i.e., the chunks that make up each file
               Locations of each chunk’s replicasdThe replica locations for each chunk is stored
                on the local chunkserver apart from being replicated, and the information of the
                replications is provided to the master at startup or when a chunkserver is added to
                a cluster. Since the master controls the chunk placement it always updates meta-
                data as new chunks get written.

                The master keeps track on the health of the entire cluster through handshaking with
             all the chunkservers. Periodic checksums are executed to keep track of any data cor-
             ruption. Due to the volume and scale of processing, there are chances of data getting
             corrupt or stale.
                To recover from any corruption, GFS appends data as it is available rather than
             update existing dataset, which provides the ability to recover from corruption or
   25   26   27   28   29   30   31   32   33   34   35