Page 45 - Building Big Data Applications

P. 45

Chapter 2 Infrastructure and technology 39

FIGURE 2.11 Conceptual SQL/MapReduce architecture.

Files once processed cannot be processed from a mid-point. If a new version of the
data is sent by ﬁles, the entire ﬁle has to be processed
MapReduce on large clusters can be difﬁcult to manage
The entire platform by design is oriented to handle extremely large ﬁles and hence
is not suited for transaction processing
When the ﬁles are broken for processing, the consistency of the ﬁles completing
processing on all nodes in a cluster is a soft state model of eventual consistency

Zookeeper

Developing large-scale applications on Hadoop or any distributed platform mandates
that a resource and application coordinator be available to coordinate the tasks between
nodes. In a controlled environment like the RDBMS or SOA programming, the tasks are
generated in a controlled manner and the coordination simply needs to ensure suc-
cessful network management without data loss and the health check on the nodes in a
distributed system. In the case of Hadoop, the minimum volumes of data starts with
multi-terabytes and the data is distributed across ﬁles on multiple nodes. Keeping users
queries and associated tasks mandates a coordinator that is as ﬂexible and scalable as
the platform itself.
ZooKeeper is an open source, in-memory, distributed NoSQL database that is used
for coordination services for managing distributed applications. It consists of a simple
set of functions that can be used to build services for synchronization, conﬁguration
maintenance, groups, and naming. Zookeeper has a ﬁlesystem structure that mirrors

40 41 42 43 44 45 46 47 48 49 50