Page 49 - Building Big Data Applications

P. 49

Chapter 2 Infrastructure and technology 43

In Hadoop deployment, Zookeeper serves as the coordinator for managing all the key
activities:
Manage conﬁguration across nodesdZooKeeper helps you quickly push conﬁgu-
ration changes across dozens or hundreds of nodes
Implement reliable messagingdA guaranteed messaging architecture to deliver
messages can be implemented with Zookeeper.
Implement redundant servicesdManaging a large number of nodes with a Zab
approach will provide a scalable redundancy management solution.
Synchronize process executiondWith ZooKeeper, multiple nodes can coordinate
the start and end of a process or calculation. This approach can ensure consistency
of completion of operations.
Pig

Analyzing large data sets introduces data ﬂow complexities that become harder to
implement in a MapReduce program as data volumes and processing complexities in-
crease. A high-level language that is more user friendly and is SQL-like in terms of
expressing data ﬂows and has the ﬂexibility to manage multi-step data transformations,
handle joins with simplicity, and easy program ﬂow was needed as an abstraction layer
over MapReduce.
Apache Pig is a platform that has been designed and developed for analyzing large
data sets. Pig consists of a high-level language for expressing data analysis programs and
comes with infrastructure for evaluating these programs. At the time of writing this book
(2012), Pig’s current infrastructure consists of a compiler that produces sequences of
MapReduce programs. Pig’s language architecture is a textual language platform called
Pig Latin, whose design goals were based on the requirement to handle large data
processing with minimal complexity and include the following:

Programming Flexibilitydability to break down complex tasks comprised of mul-
tiple steps and interprocess-related data transformations should be encoded as
data ﬂow sequences that are easy to design, develop, and maintain.
Automatic OptimizationdTasks are encoded to let the system optimize their
execution automatically. This allows the user with greater focus on program devel-
opment allowing the user to focus on semantics rather than efﬁciency.
ExtensibilitydUsers can develop UDFs for more complex processing requirements

Programming with Pig Latin

Pig is primarily a scripting language for exploring large datasets. It is developed to
process multiple terabytes of data in half-dozen lines of Pig Latin code. Pig provides
several commands to the developer for introspectingthe data structures in the program,
as it is written.

44 45 46 47 48 49 50 51 52 53 54