Page 45 - Building Big Data Applications
P. 45
Chapter 2 Infrastructure and technology 39
FIGURE 2.11 Conceptual SQL/MapReduce architecture.
Files once processed cannot be processed from a mid-point. If a new version of the
data is sent by files, the entire file has to be processed
MapReduce on large clusters can be difficult to manage
The entire platform by design is oriented to handle extremely large files and hence
is not suited for transaction processing
When the files are broken for processing, the consistency of the files completing
processing on all nodes in a cluster is a soft state model of eventual consistency
Zookeeper
Developing large-scale applications on Hadoop or any distributed platform mandates
that a resource and application coordinator be available to coordinate the tasks between
nodes. In a controlled environment like the RDBMS or SOA programming, the tasks are
generated in a controlled manner and the coordination simply needs to ensure suc-
cessful network management without data loss and the health check on the nodes in a
distributed system. In the case of Hadoop, the minimum volumes of data starts with
multi-terabytes and the data is distributed across files on multiple nodes. Keeping users
queries and associated tasks mandates a coordinator that is as flexible and scalable as
the platform itself.
ZooKeeper is an open source, in-memory, distributed NoSQL database that is used
for coordination services for managing distributed applications. It consists of a simple
set of functions that can be used to build services for synchronization, configuration
maintenance, groups, and naming. Zookeeper has a filesystem structure that mirrors