Page 59 - Building Big Data Applications
P. 59

Chapter 2   Infrastructure and technology  53


                      - The final results are stored in a temporary location and at the completion of
                         the entire query, the results are moved to the table if inserts or partitions, or
                         returned to the calling program at a temporary location
                   The comparison between how Hive executes versus a traditional RDBMS shows that
                 due to the schema on read design, the data placement, partitioning, joining, and storage
                 can be decided at the execution time rather than planning cycles.

                 Hive data types


                 Hive supports the following data typesdtinyint, int, smallint, bigint, float, boolean,
                 string, and double. Special data types include Array, Map(keyevalue pair), and Struct
                 (collection of names fields).

                 Hive query language (HiveQL)
                 The Hive query language (HiveQL) is an evolving system that supports a lot of SQL
                 functionality on Hadoop, abstracting the MapReduce complexity to the end users.
                   Traditional SQL features like select, create table, insert, “from clause” subqueries,
                 various types of joinsdinner, left outer, right outer and outer joins, “group by”and ag-
                 gregations, union all, create table as select, and many useful functions.

                 Hive examples


                 Count Rows in a table e
                   SELECT COUNT(1) FROM table2;
                   SELECT COUNT(*) FROM table2;
                   Order By - colOrder: (ASC j DESC)
                   orderBy: ORDER BY colNamecolOrder?(‘,’ colNamecolOrder?)*
                   query: SELECT expression (‘,’ expression)* FROM srcorderBy

                 Chukwa

                 Chukwa is an open source data collection system for monitoring large distributed sys-
                 tems. Chukwa is built on top of the Hadoop distributed filesystem (HDFS) and
                 MapReduce framework. There is a flexible and powerful toolkit for displaying, moni-
                 toring, and analyzing results to make the best use of the collected data available in
                 Chukwa.

                 Flume

                 Flume is a distributed, reliable, and available service for efficiently collecting, aggre-
                 gating, and moving large amounts of log data. It has a simple and flexible architecture
                 based on streaming data flows. It is robust and fault tolerant with tunable reliability
   54   55   56   57   58   59   60   61   62   63   64