Page 28 - Building Big Data Applications
P. 28

22 Building Big Data Applications


             Technologies for big data processing

             There are several technologies that have come and gone in the data processing world,
             from the mainframes, to two tier databases, to VSAM files. Several programming lan-
             guages have evolved to solve the puzzle of high-speed data processing and have either
             stayed niche or never found adoption. After the initial hype and bust of the Internet
             bubble, there came a moment in the history of data processing that caused unrest in the
             industry, the scalability of the Internet search. Technology startups like Google,
             RankDex(now known as Baidu), and Yahoo, open source projects like Nutch were all
             figuring out how to increase the performance of the search query to scale infinitely. Out
             of these efforts came the technologies, which are now the foundation of big data
             processing.


             MapReduce

             MapReduce is a programming model for processing extremely large sets of data. Google
             originally developed it for solving the scalability of search computation. Its foundations
             are based on principles of parallel and distributed processing without any database
             dependency. The flexibility of MapReduce lies in the ability to process distributed
             computations on large amounts of data on clusters of commodity servers, with simple
             task based models for management of the same.
                The key features of MapReduce that makes it the interface on Hadoop or Cassandra
             include the following:
               Automatic parallelization
               Automatic distribution
               Faulttolerance
               Status and monitoring tools
               Easy abstraction for programmers
               Programming language flexibility
               Extensibility


             MapReduce programming model

             MapReduce is based on functional programming models largely from Lisp. Typically the
             users will implement two functions:

               Map (in_key, in_value) -> (out_key, intermediate_value) list
                  Map function written by the user, will receive an input pair of keys and values,
                   and postcomputation cycles produces a set of intermediate key/value pairs.
                  Library functions then are used to group together all intermediate values associ-
                   ated with an intermediate key I and passes them to the Reduce function.
   23   24   25   26   27   28   29   30   31   32   33