Page 55 - Building Big Data Applications
P. 55

Chapter 2   Infrastructure and technology  49


                 to become stale. When this happens, the client first queries the META and if the META
                 table itself has moved to another location, the client traverses back to the ROOT table to
                 get further information.
                   When clients write data to HBASE tables, this data is first processed inmemory. When
                 the memory becomes full, the data is flushed to a log file. The file is available on HDFS
                 for use by HBASE in crash recovery situations.
                   HBASE is a powerful column-oriented datastore, which is truly a sparse, distributed,
                 persistent multidimensional sorted map. It is the database of choice for all Hadoop
                 deployments as it can hold the keyevalue outputs from MapReduce and other sources in
                 a scalable and flexible architecture.

                 Hive
                 The scalability of Hadoop and HDFS is unparalleled based on the underlying architec-
                 ture and the design of the platform. While HBASE provides some pseudo database
                 features, business users working on Hadoop did not adopt the platform due to lack of
                 SQL support or SQL-like features on Hadoop. While we understand that Hadoop cannot
                 answer low latency queries and deep analytical functions like the database, it has large
                 data sets that cannot be processed by the database infrastructure and needs to be
                 harnessed with some SQL-like language or infrastructure that can run MapReduce in the
                 internals. This is where HIVE comes into play.
                   Hive is an opensource data warehousing solution that has been built on top of Hadoop.
                   The fundamental goals of designing Hive are as follows:
                   To build a system for managing and querying data using structured techniques on
                   Hadoop
                   Use native MapReduce for execution at HDFS and HADOOP layers
                   Use HDFS for storage of Hive data
                   Store key metadata in an RDBMS
                   Extend SQL InterfacesdFamiliar data warehousing tool in use at enterprises
                   High ExtensibilitydUser-defined types, user-defined functions, formats, scripts
                   Leverage extreme scalability and performance of Hadoop
                   Interoperability with other platforms

                   Hive supports queries expressed in an SQL-like declarative languagedHiveQL, which
                 are compiled into MapReduce jobs executed on Hadoop. Hive also includes a system
                 catalog, metastore, which contains schemas and statistics and is used in data exploration
                 and query optimization.
                   Hive was originally conceived and developed at Facebook when the data scalability
                 needs of Facebook outpaced and outgrew any traditional solution. Over the last few
                 years, Hive has been released as an open source platform on the Apache Hadoop project.
                 Let us take a quick look at Hive architecture (Fig. 2.15).
   50   51   52   53   54   55   56   57   58   59   60