Page 55 - Building Big Data Applications
P. 55
Chapter 2 Infrastructure and technology 49
to become stale. When this happens, the client first queries the META and if the META
table itself has moved to another location, the client traverses back to the ROOT table to
get further information.
When clients write data to HBASE tables, this data is first processed inmemory. When
the memory becomes full, the data is flushed to a log file. The file is available on HDFS
for use by HBASE in crash recovery situations.
HBASE is a powerful column-oriented datastore, which is truly a sparse, distributed,
persistent multidimensional sorted map. It is the database of choice for all Hadoop
deployments as it can hold the keyevalue outputs from MapReduce and other sources in
a scalable and flexible architecture.
Hive
The scalability of Hadoop and HDFS is unparalleled based on the underlying architec-
ture and the design of the platform. While HBASE provides some pseudo database
features, business users working on Hadoop did not adopt the platform due to lack of
SQL support or SQL-like features on Hadoop. While we understand that Hadoop cannot
answer low latency queries and deep analytical functions like the database, it has large
data sets that cannot be processed by the database infrastructure and needs to be
harnessed with some SQL-like language or infrastructure that can run MapReduce in the
internals. This is where HIVE comes into play.
Hive is an opensource data warehousing solution that has been built on top of Hadoop.
The fundamental goals of designing Hive are as follows:
To build a system for managing and querying data using structured techniques on
Hadoop
Use native MapReduce for execution at HDFS and HADOOP layers
Use HDFS for storage of Hive data
Store key metadata in an RDBMS
Extend SQL InterfacesdFamiliar data warehousing tool in use at enterprises
High ExtensibilitydUser-defined types, user-defined functions, formats, scripts
Leverage extreme scalability and performance of Hadoop
Interoperability with other platforms
Hive supports queries expressed in an SQL-like declarative languagedHiveQL, which
are compiled into MapReduce jobs executed on Hadoop. Hive also includes a system
catalog, metastore, which contains schemas and statistics and is used in data exploration
and query optimization.
Hive was originally conceived and developed at Facebook when the data scalability
needs of Facebook outpaced and outgrew any traditional solution. Over the last few
years, Hive has been released as an open source platform on the Apache Hadoop project.
Let us take a quick look at Hive architecture (Fig. 2.15).