Page 49 - Building Big Data Applications
P. 49

Chapter 2   Infrastructure and technology  43


                   In Hadoop deployment, Zookeeper serves as the coordinator for managing all the key
                 activities:
                   Manage configuration across nodesdZooKeeper helps you quickly push configu-
                   ration changes across dozens or hundreds of nodes
                   Implement reliable messagingdA guaranteed messaging architecture to deliver
                   messages can be implemented with Zookeeper.
                   Implement redundant servicesdManaging a large number of nodes with a Zab
                   approach will provide a scalable redundancy management solution.
                   Synchronize process executiondWith ZooKeeper, multiple nodes can coordinate
                   the start and end of a process or calculation. This approach can ensure consistency
                   of completion of operations.
                 Pig

                 Analyzing large data sets introduces data flow complexities that become harder to
                 implement in a MapReduce program as data volumes and processing complexities in-
                 crease. A high-level language that is more user friendly and is SQL-like in terms of
                 expressing data flows and has the flexibility to manage multi-step data transformations,
                 handle joins with simplicity, and easy program flow was needed as an abstraction layer
                 over MapReduce.
                   Apache Pig is a platform that has been designed and developed for analyzing large
                 data sets. Pig consists of a high-level language for expressing data analysis programs and
                 comes with infrastructure for evaluating these programs. At the time of writing this book
                 (2012), Pig’s current infrastructure consists of a compiler that produces sequences of
                 MapReduce programs. Pig’s language architecture is a textual language platform called
                 Pig Latin, whose design goals were based on the requirement to handle large data
                 processing with minimal complexity and include the following:

                   Programming Flexibilitydability to break down complex tasks comprised of mul-
                   tiple steps and interprocess-related data transformations should be encoded as
                   data flow sequences that are easy to design, develop, and maintain.
                   Automatic OptimizationdTasks are encoded to let the system optimize their
                   execution automatically. This allows the user with greater focus on program devel-
                   opment allowing the user to focus on semantics rather than efficiency.
                   ExtensibilitydUsers can develop UDFs for more complex processing requirements


                 Programming with Pig Latin

                 Pig is primarily a scripting language for exploring large datasets. It is developed to
                 process multiple terabytes of data in half-dozen lines of Pig Latin code. Pig provides
                 several commands to the developer for introspectingthe data structures in the program,
                 as it is written.
   44   45   46   47   48   49   50   51   52   53   54