Page 33 - Building Big Data Applications
P. 33
Chapter 2 Infrastructure and technology 27
FIGURE 2.5 Apache top level Hadoop projects.
HDFS
The biggest problem experienced by the early developers of large-scale data processing
was the ability to break down the files across multiple systems and process each piece of
the file independent of the other pieces, but yet consolidate the results together in a
single result set. The secondary problem that remained unsolved for was the fault
tolerance both at the file processing level and the overall system level in the distributed
processing systems.
With GFS the problem of scalingout processing across multiple systems was largely
solved. HDFS, which is derived from NDFS, was designed to solve the large distributed
data processing problem. Some of the fundamental design principles of HDFS are the
following:
FIGURE 2.6 Core Hadoop components (circa 2017).