Page 24 - Building Big Data Applications
P. 24

18 Building Big Data Applications


             data processing requirements. In the rest of this chapter, the intent is to provide you with
             how data processing is managed by these platforms. This chapter is not a tutorial for
             step-by-step configuration and usage of these technologies. There are references pro-
             vided at the end of this chapter for further reading.

             Distributed data processing

             Before we proceed to understand how big data technologies work and see associated
             reference architectures, let us take a recap at distributed data processing.
                Distributed data processing has been in existence since late 1970s. The primary
             concept was to replicate the DBMS in a mastereslave configuration and process data
             across multiple instances. Each slave would engage in a two-phase commit with its
             master in a query-processing situation. Several papers exist on the subject and how its
             early implementations have been designed and authored by Dr.Stonebraker, Teradata,
             UC Berkley Departments, and others.
                Several commercial and early open source DBMS systems have addressed large-scale
             data processing with distributed data management algorithms; however, they all faced
             problems in the areas of concurrency, fault tolerance, supporting multiple copies of data,
             and distributed processing of programs. A bigger barrier was the cost of infrastructure
             (Fig. 2.1).
                Why distributed data processing failed in the relational architecture? The answer to
             this question lies in multiple dimensions:
               Dependency on RDBMS
                  ACID compliance for transaction management
                  Complex architectures for consistency management
                  Latencies across the system
                   - Slownetworks
                   - RDBMS IO
                   - SAN architecture
               Infrastructure cost
               Complex processing structure

















                    FIGURE 2.1 Distributed data processing in the relational database management system (RDBMS).
   19   20   21   22   23   24   25   26   27   28   29