Page 32 - Building Big Data Applications
P. 32
26 Building Big Data Applications
History
No book is complete without the history of Hadoop. The project started out as a sub-
project in the open source search engine called Nutch, which was started by Mike
Cafarella and Doug Cutting. In 2002 the two developers and architects realized that while
they built a successful crawler, it cannot scale up or scale out. Around the same time,
Google announced the availability of GFS to developers, which was quickly followed by
the papers on MapReduce in 2002.
In 2004 the Nutch team developed the NDFS, an open source distributed filesystem,
which was the open source implementation of GFS. The NDFS architecture solved the
storage and associated scalability issues. In 2005, the Nutch team completed the port of
Nutch algorithms to MapReduce. The new architecture would enable processing of large
and unstructured data with unsurpassed scalability.
In 2006 the Nutch team of Cafarella and Cutting created a subproject under Apache
Lucene and called it Hadoop (named after Doug Cutting’s son’s toy elephant). Yahoo
adopted the project and in January 2008 released the first complete project release of
Hadoop under open source.
The first generation of Hadoop consisted of HDFS (modeled after NDFS) distributed
filesystem and MapReduce framework along with a coordinator interface and an inter-
face to write and read from HDFS. When the first generation of Hadoop architecture was
conceived and implemented in 2004 by Cutting and Cafarella, they were able to auto-
mate a lot of operations on crawling and indexing on search, and improved efficiencies
and scalability. Within a few months they reached an architecture scalability of 20 nodes
running Nutch without missing a heartbeat. This provided Yahoo the next move to hire
Cutting and adopt Hadoop to become one of its core platforms. Yahoo kept the platform
moving with its constant innovation and research. Soon many committers and volunteer
developers/testers started contributing to the growth of a healthy ecosystem around
Hadoop.
At this time of writing (2018), we have seen two leading distributors of Hadoop with
management tools and professional services emergedCloudera and HortonWorks. We
have also seen the emergence of Hadoop-based solutions from MapR, IBM, Teradata,
Oracle, and Microsoft. Vertica, SAP, and others are also announcing their own solutions
in multiple partnerships with other providers and distributors.
The most current list at Apache’s website for Hadoop lists the top level stable projects
and releases and also incubated projects which are evolving Fig. 2.5.
Hadoop core components
At the heart of the Hadoop framework or architecture there are components that can be
called as the foundational core. These components include the following (Fig. 2.6):
Let us take a quick look at these components and further understand the ecosystem
evolution and recent changes.