Page 39 - Building Big Data Applications

P. 39

Chapter 2 Infrastructure and technology 33

BackupNode

The BackupNode can be considered as a read-only NameNode. It contains all ﬁl-
esystem’s metadata information except for block locations. It accepts a stream of
namespace transactions from the active NameNode and saves them to its own storage
directories, and applies these transactions to its own namespace image in its memory. If
the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a
record of the latest namespace state and can be used to create a checkpoint for recovery.
Creating a checkpoint from a BackupNode is very efﬁcient as it processes the entire
image in its own disk and memory.
A BackupNode can perform all operations of the regular NameNode that does not
involve modiﬁcation of the namespaceor management of block locations. This feature
provides the administrators the option of running a NameNode without persistent
storage, delegating responsibility for the namespace state persisting to the BackupNode.
This is not a normal practice, but can be used in certain situations.

Filesystem snapshots

Like any ﬁlesystem, there are periodic upgrades and patches that might need to be
applied to the HDFS. The possibility of corrupting the system due to software bugs or
human mistakes always exists. In order to avoid system corruption or shutdown, we can
create snapshots in HDFS. The snapshot mechanism lets administrators save the current
state of the ﬁlesystem to create a rollback in case of failure.
Load balancing, disk management, block allocation, and advanced ﬁle management
are topics handled by HDFS design. For further details on these areas, refer to the HDFS
architecture guide on Apache HDFS project page.
Based on the brief architecture discussion of HDFS, we can see how Hadoop achieves
unlimited scalability and manages redundancy while keeping the basic data manage-
ment functions managed through a series of API calls.

MapReduce

We discussed earlier in the chapter on the pure MapReduce implementations on GFS, in
the big data technology deployment, Hadoop is the most popular and deployed platform
for the MapReduce framework. There are three key reasons for this:
Extreme parallelism available in Hadoop
Extreme scalability programmable with MapReduce
The HDFS architecture
To run a query or any procedural language like Java or Cþþ in Hadoop, we need to
execute the program with a MapReduce API component. Let us revisit the MapReduce
components in the Hadoop architecture to understand the overall design approach
needed for such a deployment.

34 35 36 37 38 39 40 41 42 43 44