Page 58 - Building Big Data Applications
P. 58

52 Building Big Data Applications


                We can further analyze this workflow of processing as follows:

               Hive client triggers a query
               Compiler receives the query and connects to metastore
               Compiler receives the query and initiates the first phase of compilation
                  ParserdConverts the query into parse tree representation. Hive uses ANTLR to
                   generate the abstract syntax tree (AST)
                  Semantic AnalyzerdIn this stage the compiler builds a logical plan based on the
                   information that is provided by the metastore on the input and output tables.
                   Additionally the complier also checks type compatibilities in expressions and
                   flags compile time semantic errors at this stage. The best step is the trans-
                   formation of an AST to intermediate representation that is called the query
                   block (QB) tree. Nested queries are converted into parentechild relationships in
                   a QB tree during this stage
                  Logical Plan GeneratordIn this stage the compiler writes the logical plan from
                   the semantic analyzer into a logical tree of operations
                  OptimizationdThis is the most involved phase of the complier as the entire se-
                   ries of DAG optimizations are implemented in this phase. There are several cus-
                   tomizations than can be done to the complier if desired. The primary
                   operations done at this stage are as follows:
                   - Logical optimizationdPerform multiple passes over logical plan and rewrites
                      in several ways
                   - Column pruningdThis optimization step ensures that only the columns that
                      are needed in the query processing are actually projected out of the row
                   - Predicate pushdowndPredicates are pushed down to the scan if possible so
                      that rows can be filtered early in the processing
                   - Partition pruningdPredicates on partitioned columns are used to prune out
                      files of partitions that do not satisfy the predicate
                   - Join optimization
                   - Grouping and regrouping
                   - Repartitioning
                   - Physical plan generator converts logical plan into physical.
                   - Physical plan generation creates the final DAG workflow of MapReduce
                  Execution engine gets the compiler outputs to execute on the Hadoop platform.
                   - All the tasks are executed in the order of their dependencies. Each task is
                      only executed if all of its prerequisites have been executed.
                   - A map/reduce task first serializes its part of the plan into a plan.xml file.
                   - This file is then added to the job cache for the task and instances of
                      ExecMapper and ExecReducers are spawned using Hadoop.
                   - Each of these classes deserializes the plan.xml and executes the relevant part
                      of the task.
   53   54   55   56   57   58   59   60   61   62   63