Page 35 - Building Big Data Applications
P. 35
Chapter 2 Infrastructure and technology 29
operations like opening, closing, moving, naming, renaming of files, and directories. It
also manages the mapping of blocks to DataNodes.
DataNode
DataNodes represent the slave in the architecture that manages data and the storage
attached to it. A typical HDFS cluster can have thousands of DataNodes and tens of
thousands of HDFS clients per cluster, since each DataNode may execute multiple
application tasks simultaneously. The DataNodes are responsible for managing read and
write requests from the filesystem’s clients and block maintenance and replication as
directed by the NameNode. The block management in HDFS is different from a normal
filesystem. The size of the data file equals the actual length of the block. This means if a
block is half full it needs only half of the space of the full block on the local drive, thereby
optimizing storage space for compactness, and there is no extraspace consumed on the
block unlike a regular filesystem.
A filesystem-based architecture needs to manage consistency, recoverability, and
concurrency for reliable operations. HDFS manages these requirements by creating
image, journal, and checkpoint files.
Image
An image represents the metadata of the namespace (inodesand lists of blocks). On
startup, the NameNode pins the entire namespace image in memory. The in-memory
persistence enables the NameNode to service multiple client requests concurrently.
Journal
The Journal represents the modification log of the image in the local host’s native
filesystem. During normal operations, each client transaction is recorded in the
journal, and the journal file is flushed and synced before the acknowledgment is
sent to the client. The NameNode upon startup or from a recovery can replay this
journal.
Checkpoint
To enable recovery, the persistent record of the image is also stored in the local host’s
native files system and is called a checkpoint. Once the system starts-up, the NameNode
never modifies or updates the checkpoint file. A new checkpoint file can be created