Page 35 - Building Big Data Applications

P. 35

Chapter 2 Infrastructure and technology 29

operations like opening, closing, moving, naming, renaming of ﬁles, and directories. It
also manages the mapping of blocks to DataNodes.

DataNode

DataNodes represent the slave in the architecture that manages data and the storage
attached to it. A typical HDFS cluster can have thousands of DataNodes and tens of
thousands of HDFS clients per cluster, since each DataNode may execute multiple
application tasks simultaneously. The DataNodes are responsible for managing read and
write requests from the ﬁlesystem’s clients and block maintenance and replication as
directed by the NameNode. The block management in HDFS is different from a normal
ﬁlesystem. The size of the data ﬁle equals the actual length of the block. This means if a
block is half full it needs only half of the space of the full block on the local drive, thereby
optimizing storage space for compactness, and there is no extraspace consumed on the
block unlike a regular ﬁlesystem.
A ﬁlesystem-based architecture needs to manage consistency, recoverability, and
concurrency for reliable operations. HDFS manages these requirements by creating
image, journal, and checkpoint ﬁles.

Image

An image represents the metadata of the namespace (inodesand lists of blocks). On
startup, the NameNode pins the entire namespace image in memory. The in-memory
persistence enables the NameNode to service multiple client requests concurrently.

Journal

The Journal represents the modiﬁcation log of the image in the local host’s native
ﬁlesystem. During normal operations, each client transaction is recorded in the
journal, and the journal ﬁle is ﬂushed and synced before the acknowledgment is
sent to the client. The NameNode upon startup or from a recovery can replay this
journal.

Checkpoint

To enable recovery, the persistent record of the image is also stored in the local host’s
native ﬁles system and is called a checkpoint. Once the system starts-up, the NameNode
never modiﬁes or updates the checkpoint ﬁle. A new checkpoint ﬁle can be created

30 31 32 33 34 35 36 37 38 39 40