Page 28 - Building Big Data Applications
P. 28
22 Building Big Data Applications
Technologies for big data processing
There are several technologies that have come and gone in the data processing world,
from the mainframes, to two tier databases, to VSAM files. Several programming lan-
guages have evolved to solve the puzzle of high-speed data processing and have either
stayed niche or never found adoption. After the initial hype and bust of the Internet
bubble, there came a moment in the history of data processing that caused unrest in the
industry, the scalability of the Internet search. Technology startups like Google,
RankDex(now known as Baidu), and Yahoo, open source projects like Nutch were all
figuring out how to increase the performance of the search query to scale infinitely. Out
of these efforts came the technologies, which are now the foundation of big data
processing.
MapReduce
MapReduce is a programming model for processing extremely large sets of data. Google
originally developed it for solving the scalability of search computation. Its foundations
are based on principles of parallel and distributed processing without any database
dependency. The flexibility of MapReduce lies in the ability to process distributed
computations on large amounts of data on clusters of commodity servers, with simple
task based models for management of the same.
The key features of MapReduce that makes it the interface on Hadoop or Cassandra
include the following:
Automatic parallelization
Automatic distribution
Faulttolerance
Status and monitoring tools
Easy abstraction for programmers
Programming language flexibility
Extensibility
MapReduce programming model
MapReduce is based on functional programming models largely from Lisp. Typically the
users will implement two functions:
Map (in_key, in_value) -> (out_key, intermediate_value) list
Map function written by the user, will receive an input pair of keys and values,
and postcomputation cycles produces a set of intermediate key/value pairs.
Library functions then are used to group together all intermediate values associ-
ated with an intermediate key I and passes them to the Reduce function.