Page 36 - Building Big Data Applications
P. 36

30 Building Big Data Applications


             during the next startup, on a restart or on demand when requested by the administrator
             or by the CheckpointNode (described later in this chapter).


             HDFS startup

             Since the image is an in-memory persistence, during initial startup everytime, the
             NameNode initializes a namespace image from the checkpoint file and replays changes
             from the journal. Once the startup sequence completes the process, a new checkpoint
             and an empty journal are written back to the storage directories and the NameNode
             starts serving client requests. For improved redundancy and reliability, copies of
             checkpoint and journal can be made at other servers.

             Block allocation and storage


             Data organization in the HDFS is managed similar to GFS. The namespace is represented
             by inodes, which represent files. Directories and records attributes like permissions,
             modification, and access times, namespace and disk space quotas. The files are split into
             use-defined block sizes (default is 128 MB) and stored into a DataNode and two replicas
             at a minimum to ensure availability and redundancy, though the user can configure
             more replicas. Typically the storage location of block replicas may change over time and
             hence are not part of the persistent checkpoint.


             HDFS client

             A thin layer of interface that is used by programs to access data stored within HDFS, is
             called the Client. The client first contacts the NameNode to get the locations of data
             blocks that comprise the file. Once the block data is returned to the client, subsequently
             the client reads block contents from the DataNode closest to it.
                When writing data, the client first requests the NameNode to provide the DataNodes
             where the data can be written. The NameNode returns the block to write the data. When
             the first block is filled, additional blocks are provide by the NameNode in a pipeline. A
             block for each request might not be on the same DataNode.
                One of the biggest design differentiators of HDFS is the API that exposes the loca-
             tions of a file blocks. This allows applications like MapReduce to schedule a task to
             where the data is located, thus improving the IO performance. The API also includes
             functionality to set the replication factor for each file. To maintain file and block
             integrity, once a block is assigned to a DataNode, two files are created to represent each
             replica in the local host’s native filesystem. The first file contains the data itself and the
             second file is block’s metadata including checksums for each data block and generation
             stamp.
   31   32   33   34   35   36   37   38   39   40   41