Page 57 - Building Big Data Applications
P. 57
Chapter 2 Infrastructure and technology 51
The next section describes in detail the functionality of the components broken down
into infrastructure and execution.
Infrastructure
MetastoredThe metastore is the system catalog which contains metadata about
the tables stored in Hive. Metadata is specified during table creation and reused
everytime the table is used or specified in HiveQL. Metastore can be compared to a
system catalog in a traditional database speak. The metastore contains the
following objects:
Databasedis the default namespace for tables. Users can create a database and
name it. The database “default” is used for tables when no user supplied database
name.
TabledA Hive table is made up of the data being stored in it and the associated
metadata metastore.
In the physical implementation the data typically resides in HDFS, although it
may be in any Hadoop filesystem, including the local filesystem.
Metadata for table typically contains the list of columns and their data types,
owner, user-supplied keys, storage, and SerDe information.
Storage information includes location of the table’s data in the filesystem, data
formats, and bucketing information.
SerDe metadata includes the implementation class of serializer and deserializer
methods and any supporting information required by that implementation.
All this information can be specified during the initial creation of table.
PartitiondIn order to gain further performance and scalability, Hive organizes ta-
bles into partitions.
A partition contains parts of the data, based on the value of a partition column,
for example date or LatLong.
Tables or partitions can be further subdivided into buckets. A bucket is akin to a
subpartition. An example is to bucket a partition of customers by customer_id.
Each partition can have its own columns and SerDe and storage information.
Executiondhow does Hive process queries?
A HiveQL statement is submitted via the CLI, the web UI, or an external client using the
Thrift, ODBC, or JDBC API. The driver first passes the query to the compiler where it goes
through parse, type check, and semantic analysis using the metadata stored in the
metastore. The compiler generates a logical plan that is then optimized through a simple
ruleebased optimizer. Finally an optimized plan in the form of a DAG of mapreduce
tasks and HDFS tasks is generated. The execution engine then executes these tasks in the
order of their dependencies, using Hadoop.