Page 60 - Building Big Data Applications
P. 60
54 Building Big Data Applications
mechanisms and many failover and recovery mechanisms. It uses a simple extensible
data model that allows for online analytic application.
Oozie
Oozie is a workflow/coordination system to manage Apache Hadoop jobs. Oozie
workflow jobs are Directed Acyclical Graphs (DAGs) of actions like a MapReduce model.
Oozie coordinator jobs are recurrent Oozie workflow jobs triggered by time (frequency)
and data availability. Oozie is integrated with the rest of the Hadoop stack supporting
several types of Hadoop jobs out of the box (Java MapReduce, Streaming MapReduce,
Pig, DistCp, etc.) Oozie is a scalable, reliable and extensible system.
HCatalog
A new integrated metadata layer called HCatalog has been added to the Hadoop
ecosystem recently (late 2011). It is built on top of the Hive metastore currently and
incorporates components from Hive DDL. HCatalog provides read and write interfaces
for Pig, MapReduce, and Hive in one integrated repository. By an integrated repository
the users can explore any data across Hadoop using the tools built on its platform.
HCatalog’s abstraction presents users with a relational view of data in the Hadoop
distributed filesystem (HDFS) and ensures that users need not worry about where or in
what format their data is stored. HCatalog currently supports reading and writing files in
any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV,
JSON, and sequence file formats, which is supported out of the box. To use a custom
format, you must provide the InputFormat, OutputFormat, and SerDe, and the format
will be implemented as it can be in the current Hadoop ecosystem. (For further details
on HCatalog, please check with Apache Foundation page or HortonWorks).
Sqoop
As Hadoop ecosystem evolves, we will find the need to integrate data from other existing
“enterprise” data platforms including the data warehouse, metadata engines, enterprise
systems (ERP, SCM), and transactional systems. All of this data cannot be moved to
Hadoop as their nature of small volumes, low latency, and computations are not ori-
ented for Hadoop workloads. To provide a connection between Hadoop and the RDBMS
platforms, Sqoop has been developed as the connector. There are two versions of Sqoop1
and Sqoop2, lets us take a quick look at this technology.
Sqoop1dIn the first release of Sqoop, the design goals were very simple
Export/Import data from enterprise data warehouse, relational databases, and
NoSQL databases
Connector-based architecture with plugins from vendors
No metadata store