Page 58 - Building Big Data Applications
P. 58
52 Building Big Data Applications
We can further analyze this workflow of processing as follows:
Hive client triggers a query
Compiler receives the query and connects to metastore
Compiler receives the query and initiates the first phase of compilation
ParserdConverts the query into parse tree representation. Hive uses ANTLR to
generate the abstract syntax tree (AST)
Semantic AnalyzerdIn this stage the compiler builds a logical plan based on the
information that is provided by the metastore on the input and output tables.
Additionally the complier also checks type compatibilities in expressions and
flags compile time semantic errors at this stage. The best step is the trans-
formation of an AST to intermediate representation that is called the query
block (QB) tree. Nested queries are converted into parentechild relationships in
a QB tree during this stage
Logical Plan GeneratordIn this stage the compiler writes the logical plan from
the semantic analyzer into a logical tree of operations
OptimizationdThis is the most involved phase of the complier as the entire se-
ries of DAG optimizations are implemented in this phase. There are several cus-
tomizations than can be done to the complier if desired. The primary
operations done at this stage are as follows:
- Logical optimizationdPerform multiple passes over logical plan and rewrites
in several ways
- Column pruningdThis optimization step ensures that only the columns that
are needed in the query processing are actually projected out of the row
- Predicate pushdowndPredicates are pushed down to the scan if possible so
that rows can be filtered early in the processing
- Partition pruningdPredicates on partitioned columns are used to prune out
files of partitions that do not satisfy the predicate
- Join optimization
- Grouping and regrouping
- Repartitioning
- Physical plan generator converts logical plan into physical.
- Physical plan generation creates the final DAG workflow of MapReduce
Execution engine gets the compiler outputs to execute on the Hadoop platform.
- All the tasks are executed in the order of their dependencies. Each task is
only executed if all of its prerequisites have been executed.
- A map/reduce task first serializes its part of the plan into a plan.xml file.
- This file is then added to the job cache for the task and instances of
ExecMapper and ExecReducers are spawned using Hadoop.
- Each of these classes deserializes the plan.xml and executes the relevant part
of the task.