Page 39 - Building Big Data Applications
P. 39

Chapter 2   Infrastructure and technology  33


                 BackupNode

                 The BackupNode can be considered as a read-only NameNode. It contains all fil-
                 esystem’s metadata information except for block locations. It accepts a stream of
                 namespace transactions from the active NameNode and saves them to its own storage
                 directories, and applies these transactions to its own namespace image in its memory. If
                 the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a
                 record of the latest namespace state and can be used to create a checkpoint for recovery.
                 Creating a checkpoint from a BackupNode is very efficient as it processes the entire
                 image in its own disk and memory.
                   A BackupNode can perform all operations of the regular NameNode that does not
                 involve modification of the namespaceor management of block locations. This feature
                 provides the administrators the option of running a NameNode without persistent
                 storage, delegating responsibility for the namespace state persisting to the BackupNode.
                 This is not a normal practice, but can be used in certain situations.

                 Filesystem snapshots

                 Like any filesystem, there are periodic upgrades and patches that might need to be
                 applied to the HDFS. The possibility of corrupting the system due to software bugs or
                 human mistakes always exists. In order to avoid system corruption or shutdown, we can
                 create snapshots in HDFS. The snapshot mechanism lets administrators save the current
                 state of the filesystem to create a rollback in case of failure.
                   Load balancing, disk management, block allocation, and advanced file management
                 are topics handled by HDFS design. For further details on these areas, refer to the HDFS
                 architecture guide on Apache HDFS project page.
                   Based on the brief architecture discussion of HDFS, we can see how Hadoop achieves
                 unlimited scalability and manages redundancy while keeping the basic data manage-
                 ment functions managed through a series of API calls.

                 MapReduce

                 We discussed earlier in the chapter on the pure MapReduce implementations on GFS, in
                 the big data technology deployment, Hadoop is the most popular and deployed platform
                 for the MapReduce framework. There are three key reasons for this:
                   Extreme parallelism available in Hadoop
                   Extreme scalability programmable with MapReduce
                   The HDFS architecture
                   To run a query or any procedural language like Java or Cþþ in Hadoop, we need to
                 execute the program with a MapReduce API component. Let us revisit the MapReduce
                 components in the Hadoop architecture to understand the overall design approach
                 needed for such a deployment.
   34   35   36   37   38   39   40   41   42   43   44