Page 65 - Building Big Data Applications
P. 65

Chapter 2   Infrastructure and technology  59



                                           C                        A



                                                        P
                                               FIGURE 2.21 CAP theorem.

                   In simple terms CAP theorem states that in a distributed data system, you can
                 guarantee two of the three requirements consistency (all data available at all nodes or
                 systems), availability (every request will get a response) and partition tolerance (system
                 will operate irrespective of availability or a partition or loss of data or communication).
                 The system architected on this model will be called BASE (basically available soft state
                 eventually consistent) architecture as opposed to ACID.
                   Combining the principles of the CAP theorem and the data architecture of Bigtable or
                 Dynamo there are several solutions that have evolveddHBase, MongoDB, Riak,
                 Voldemort, Neo4J, Cassandra, Hypertable, HyperGraphDB, Memcached, Tokyo Cabinet,
                 Redis, CouchDB, and more niche solutions. Of these the most popular and widely
                 distributed are the following:

                   HBASE, Hypertable, Bigtabledarchitected on CP (from CAP)
                   Cassandra, Dynamo, Voldemortdarchitected on AP (from CAP)

                   Broadly NoSQL databases have been classified into four subcategories.
                   Keyevalues pairdThis model is implemented using a hash table where there is a
                 unique key and a pointer to a particular item of data creating a keyevalue pair.
                 ExampledVoldemort andRiak
                   Column family storesdAn extension of the keyevalue architecture with columns
                 and column families, the overall goal was to process distributed data over a pool of
                 infrastructure. ExampledHBase and Cassandra.
                   Document databasesdthis class of databases is modeled after Lotus Notes and
                 similar to keyevalue stores. The data is stored as a document and is represented in JSON
                 or XML formats. The biggest design feature is the flexibility to list multiple levels of
                 keyevalue pairs. ExampledCouchDB.
                   Graph databasesdBased on the graph theory, this class of database supports the
                 scalability across a cluster of machines. The complexity of representation for extremely
                 complex sets of documents is evolving. ExampledNeo4J.
                   Let us focus on the different classes of NoSQL databases and understand their
                 technology approaches. We have already discussed HBASE as part of Hadoopsections in
                 this chapter.

                 Keyevalue pairdVoldemort
                 Voldemort is a project that originated in LinkedIn. The underlying need at LinkedIn was
                 a highly scalable lightweight database that can work without the rigidness of ACID
   60   61   62   63   64   65   66   67   68   69   70