Page 52 - Building Big Data Applications
P. 52
46 Building Big Data Applications
In this example the FDA data is highly semistructured and compliance logs are
generated by multiple applications. Processing large data with simple lines of code is
what Pig brings to MapReduce and Hadoop data processing.
Pig can be used more in data collection and preprocessing environments and in
streaming data processing environments. It is very useful in data discovery exercises.
HBASE
HBASE is an open source, nonrelational database modeled on Google’s Big
Table architecture, completely developed in Java. It runs on Hadoop and HDFS
providing real time read/write access to large data sets on Hadoop platform. HBASE is
not a database in a purist definition of the database. It provides unlimited scalability and
performance for RDBMS like capabilities while not being ACID compliant. HBASE has
been classified as a NOSQL database as it is modeled after Google Bigtable.
HBASE architecture
Data is organized in HBASE as rows and columns, and tables, very similar to a database,
here is where the similarity ends. Let us look at the data model of HBASE and then
understand the implementation architecture.
Data ModeldA data model of HBASE consists of Tables, Column Groups, and Rows.
Tables
Tables are made of rows and columns
Table cellsdare the intersection of row and column coordinates. Each cell is
versioned by default with a Timestamp. The contents of a cell are treated as
uninterpretedarray of bytes
A Table row has a sortable rowkey and an arbitrary number of columns
Rows
Table rowkeys are also byte arrays. In this configuration anything can serve as
the rowkey opposed to strongly typed datatypes in the traditional database
Table rows are sorted byte-ordered by rowkey, the table’s primary key, all table
accesses are via the table primary key.
Columns are grouped as families and a row can have as many columns as
loaded
Columns and Column Groups (families)
In HBASE row columns are grouped into column families
All column family members will mandatorily have a common prefix, for
example, the columns person:name and person:comments are both members of
the person column family, whereas, email:identifier belongs to the email family
A table’s column families must be specified up front as part of the table schema
definition,
New column family members can be added on demand.