Page 52 - Building Big Data Applications
P. 52

46 Building Big Data Applications


                In this example the FDA data is highly semistructured and compliance logs are
             generated by multiple applications. Processing large data with simple lines of code is
             what Pig brings to MapReduce and Hadoop data processing.
                Pig can be used more in data collection and preprocessing environments and in
             streaming data processing environments. It is very useful in data discovery exercises.

             HBASE

             HBASE is an open source, nonrelational database modeled on Google’s Big
             Table architecture, completely developed in Java. It runs on Hadoop and HDFS
             providing real time read/write access to large data sets on Hadoop platform. HBASE is
             not a database in a purist definition of the database. It provides unlimited scalability and
             performance for RDBMS like capabilities while not being ACID compliant. HBASE has
             been classified as a NOSQL database as it is modeled after Google Bigtable.

             HBASE architecture

             Data is organized in HBASE as rows and columns, and tables, very similar to a database,
             here is where the similarity ends. Let us look at the data model of HBASE and then
             understand the implementation architecture.
                Data ModeldA data model of HBASE consists of Tables, Column Groups, and Rows.

               Tables
                  Tables are made of rows and columns
                  Table cellsdare the intersection of row and column coordinates. Each cell is
                   versioned by default with a Timestamp. The contents of a cell are treated as
                   uninterpretedarray of bytes
                  A Table row has a sortable rowkey and an arbitrary number of columns
               Rows
                  Table rowkeys are also byte arrays. In this configuration anything can serve as
                   the rowkey opposed to strongly typed datatypes in the traditional database
                  Table rows are sorted byte-ordered by rowkey, the table’s primary key, all table
                   accesses are via the table primary key.
                  Columns are grouped as families and a row can have as many columns as
                   loaded
               Columns and Column Groups (families)
                  In HBASE row columns are grouped into column families
                  All column family members will mandatorily have a common prefix, for
                   example, the columns person:name and person:comments are both members of
                   the person column family, whereas, email:identifier belongs to the email family
                  A table’s column families must be specified up front as part of the table schema
                   definition,
                  New column family members can be added on demand.
   47   48   49   50   51   52   53   54   55   56   57