Page 24 - Building Big Data Applications

P. 24

18 Building Big Data Applications

data processing requirements. In the rest of this chapter, the intent is to provide you with
how data processing is managed by these platforms. This chapter is not a tutorial for
step-by-step conﬁguration and usage of these technologies. There are references pro-
vided at the end of this chapter for further reading.

Distributed data processing

Before we proceed to understand how big data technologies work and see associated
reference architectures, let us take a recap at distributed data processing.
Distributed data processing has been in existence since late 1970s. The primary
concept was to replicate the DBMS in a mastereslave conﬁguration and process data
across multiple instances. Each slave would engage in a two-phase commit with its
master in a query-processing situation. Several papers exist on the subject and how its
early implementations have been designed and authored by Dr.Stonebraker, Teradata,
UC Berkley Departments, and others.
Several commercial and early open source DBMS systems have addressed large-scale
data processing with distributed data management algorithms; however, they all faced
problems in the areas of concurrency, fault tolerance, supporting multiple copies of data,
and distributed processing of programs. A bigger barrier was the cost of infrastructure
(Fig. 2.1).
Why distributed data processing failed in the relational architecture? The answer to
this question lies in multiple dimensions:
Dependency on RDBMS
ACID compliance for transaction management
Complex architectures for consistency management
Latencies across the system
- Slownetworks
- RDBMS IO
- SAN architecture
Infrastructure cost
Complex processing structure

FIGURE 2.1 Distributed data processing in the relational database management system (RDBMS).

19 20 21 22 23 24 25 26 27 28 29