Page 60 - Building Big Data Applications
P. 60

54 Building Big Data Applications


             mechanisms and many failover and recovery mechanisms. It uses a simple extensible
             data model that allows for online analytic application.

             Oozie

             Oozie is a workflow/coordination system to manage Apache Hadoop jobs. Oozie
             workflow jobs are Directed Acyclical Graphs (DAGs) of actions like a MapReduce model.
             Oozie coordinator jobs are recurrent Oozie workflow jobs triggered by time (frequency)
             and data availability. Oozie is integrated with the rest of the Hadoop stack supporting
             several types of Hadoop jobs out of the box (Java MapReduce, Streaming MapReduce,
             Pig, DistCp, etc.) Oozie is a scalable, reliable and extensible system.

             HCatalog

             A new integrated metadata layer called HCatalog has been added to the Hadoop
             ecosystem recently (late 2011). It is built on top of the Hive metastore currently and
             incorporates components from Hive DDL. HCatalog provides read and write interfaces
             for Pig, MapReduce, and Hive in one integrated repository. By an integrated repository
             the users can explore any data across Hadoop using the tools built on its platform.
                HCatalog’s abstraction presents users with a relational view of data in the Hadoop
             distributed filesystem (HDFS) and ensures that users need not worry about where or in
             what format their data is stored. HCatalog currently supports reading and writing files in
             any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV,
             JSON, and sequence file formats, which is supported out of the box. To use a custom
             format, you must provide the InputFormat, OutputFormat, and SerDe, and the format
             will be implemented as it can be in the current Hadoop ecosystem. (For further details
             on HCatalog, please check with Apache Foundation page or HortonWorks).

             Sqoop

             As Hadoop ecosystem evolves, we will find the need to integrate data from other existing
             “enterprise” data platforms including the data warehouse, metadata engines, enterprise
             systems (ERP, SCM), and transactional systems. All of this data cannot be moved to
             Hadoop as their nature of small volumes, low latency, and computations are not ori-
             ented for Hadoop workloads. To provide a connection between Hadoop and the RDBMS
             platforms, Sqoop has been developed as the connector. There are two versions of Sqoop1
             and Sqoop2, lets us take a quick look at this technology.
                Sqoop1dIn the first release of Sqoop, the design goals were very simple
               Export/Import data from enterprise data warehouse, relational databases, and
                NoSQL databases
               Connector-based architecture with plugins from vendors
               No metadata store
   55   56   57   58   59   60   61   62   63   64   65