Page 57 - Building Big Data Applications
P. 57

Chapter 2   Infrastructure and technology  51


                   The next section describes in detail the functionality of the components broken down
                 into infrastructure and execution.

                 Infrastructure

                   MetastoredThe metastore is the system catalog which contains metadata about
                   the tables stored in Hive. Metadata is specified during table creation and reused
                   everytime the table is used or specified in HiveQL. Metastore can be compared to a
                   system catalog in a traditional database speak. The metastore contains the
                   following objects:
                   Databasedis the default namespace for tables. Users can create a database and
                   name it. The database “default” is used for tables when no user supplied database
                   name.
                   TabledA Hive table is made up of the data being stored in it and the associated
                   metadata metastore.
                     In the physical implementation the data typically resides in HDFS, although it
                      may be in any Hadoop filesystem, including the local filesystem.
                     Metadata for table typically contains the list of columns and their data types,
                      owner, user-supplied keys, storage, and SerDe information.
                     Storage information includes location of the table’s data in the filesystem, data
                      formats, and bucketing information.
                     SerDe metadata includes the implementation class of serializer and deserializer
                      methods and any supporting information required by that implementation.
                     All this information can be specified during the initial creation of table.
                   PartitiondIn order to gain further performance and scalability, Hive organizes ta-
                   bles into partitions.
                     A partition contains parts of the data, based on the value of a partition column,
                      for example date or LatLong.
                     Tables or partitions can be further subdivided into buckets. A bucket is akin to a
                      subpartition. An example is to bucket a partition of customers by customer_id.
                     Each partition can have its own columns and SerDe and storage information.


                 Executiondhow does Hive process queries?

                 A HiveQL statement is submitted via the CLI, the web UI, or an external client using the
                 Thrift, ODBC, or JDBC API. The driver first passes the query to the compiler where it goes
                 through parse, type check, and semantic analysis using the metadata stored in the
                 metastore. The compiler generates a logical plan that is then optimized through a simple
                 ruleebased optimizer. Finally an optimized plan in the form of a DAG of mapreduce
                 tasks and HDFS tasks is generated. The execution engine then executes these tasks in the
                 order of their dependencies, using Hadoop.
   52   53   54   55   56   57   58   59   60   61   62