Page 23 - Building Big Data Applications

P. 23

Infrastructure and technology

This chapter will introduce all the infrastructure components and technology vendors
who are providing services. We will discuss in detail the components and their inte-
gration, the technology limitations if any to be known, speciﬁcs on the technology for
users to identify and align with.

The ﬁrst rule of any technology used in a business is that automation applied to an
efﬁcient operation will magnify the efﬁciency. The second is that automation
applied to an inefﬁcient operation will magnify the inefﬁciency.
Source: Brainy QuoteeBill Gates

Introduction

In the previous chapter we discussed the complexities associated with big data. There is
a three-dimensional problem with processing this type of data; the dimensions being the
volume of the data produced, the variety of formats, and the velocity of data generation.
To handle any of these problems in traditional data processing architecture is not a
feasible option. The problem by itself did not originate in the last decade and has been
something that was being solved by various architects, researchers, and organizations
over the years. A simpliﬁed approach to large data processing was to create distributed
data processing architectures and manage the coordination by programming language
techniques. This approach while solving the volume requirement did not have the
capability to handle the other two dimensions. With the advent of Internet and search
engines, the need to handle the complex and diverse data became a necessity and not a
one-off requirement. It is during this time in the early 1990s a slew of distributed data
processing papers and associated algorithms and techniques were published by Google,
Stanford University, Dr.Stonebraker, Eric Brewer, Doug Cutting (Nutch Search Engine),
and Yahoo among others.
Today the various architectures and papers that were contributed by these and other
developers across the world have culminated into several open source projects under the
Apache Software Foundation and the NoSQL movement. All of these technologies have
been identiﬁed as big data processing platforms including Hadoop, Hive, HBase,
Cassandra, and MapReduce. NoSQL platforms include MongoDB, Neo4J, Riak, Amazon
DynamoDB, MemcachedDB, BerkleyDB, Voldemort, and many more. Though many of
these platforms were originally developed and deployed for solving the data processing
needs of web applications and search engines, they have been evolved to support other

18 19 20 21 22 23 24 25 26 27 28