Page 32 - Building Big Data Applications
P. 32

26 Building Big Data Applications


             History

             No book is complete without the history of Hadoop. The project started out as a sub-
             project in the open source search engine called Nutch, which was started by Mike
             Cafarella and Doug Cutting. In 2002 the two developers and architects realized that while
             they built a successful crawler, it cannot scale up or scale out. Around the same time,
             Google announced the availability of GFS to developers, which was quickly followed by
             the papers on MapReduce in 2002.
                In 2004 the Nutch team developed the NDFS, an open source distributed filesystem,
             which was the open source implementation of GFS. The NDFS architecture solved the
             storage and associated scalability issues. In 2005, the Nutch team completed the port of
             Nutch algorithms to MapReduce. The new architecture would enable processing of large
             and unstructured data with unsurpassed scalability.
                In 2006 the Nutch team of Cafarella and Cutting created a subproject under Apache
             Lucene and called it Hadoop (named after Doug Cutting’s son’s toy elephant). Yahoo
             adopted the project and in January 2008 released the first complete project release of
             Hadoop under open source.
                The first generation of Hadoop consisted of HDFS (modeled after NDFS) distributed
             filesystem and MapReduce framework along with a coordinator interface and an inter-
             face to write and read from HDFS. When the first generation of Hadoop architecture was
             conceived and implemented in 2004 by Cutting and Cafarella, they were able to auto-
             mate a lot of operations on crawling and indexing on search, and improved efficiencies
             and scalability. Within a few months they reached an architecture scalability of 20 nodes
             running Nutch without missing a heartbeat. This provided Yahoo the next move to hire
             Cutting and adopt Hadoop to become one of its core platforms. Yahoo kept the platform
             moving with its constant innovation and research. Soon many committers and volunteer
             developers/testers started contributing to the growth of a healthy ecosystem around
             Hadoop.
                At this time of writing (2018), we have seen two leading distributors of Hadoop with
             management tools and professional services emergedCloudera and HortonWorks. We
             have also seen the emergence of Hadoop-based solutions from MapR, IBM, Teradata,
             Oracle, and Microsoft. Vertica, SAP, and others are also announcing their own solutions
             in multiple partnerships with other providers and distributors.
                The most current list at Apache’s website for Hadoop lists the top level stable projects
             and releases and also incubated projects which are evolving Fig. 2.5.


             Hadoop core components

             At the heart of the Hadoop framework or architecture there are components that can be
             called as the foundational core. These components include the following (Fig. 2.6):
                Let us take a quick look at these components and further understand the ecosystem
             evolution and recent changes.
   27   28   29   30   31   32   33   34   35   36   37