Page 240 - Data Architecture

P. 240

Chapter 6.3: Introduction to Data Vault Architecture

How Does NoSQL Fit in to the Architecture?

NoSQL platform implementations will vary. Some will contain SQL-like interfaces; some
will contain relational database technology integrated with nonrelational technology. The
line between the two (RDBMS and NoSQL) will continue to be blurred. Eventually, it
will be a “data management system” capable of housing both relational and nonrelational
simply by design.

The NoSQL platform today, in most cases, is based on Hadoop at its core—which is
composed of the Hadoop distributed file system (HDFS) or metadata management for
files in the different directories. Various implementations of SQL access layers and in-

memory technology will sit on top of the HDFS.

Once atomicity, consistency, isolation, and durability (ACID) compliance is achieved

(which is available today with some NoSQL vendors), the differentiation between
RDBMS and NoSQL will fade. Note that not all Hadoop or NoSQL platforms offer ACID
compliance today and not all NoSQL platforms offer update of records in place making it
impossible to completely supplant the RDBMS technology.

This is changing quickly. Even as this section is written, the technology continues to
advance. Eventually, the technology will be seamless, and what is purchased from the
vendors in this space will be hybrid-based.

Current positioning of a platform like Hadoop is to utilize it or leverage it as an ingestion
area and a staging area for any and all data that might proceed to the warehouse. This
includes structured data sets (delimited files and fixed-width columnar files);
multistructured data sets like XML and JSON files; and unstructured data like Word
documents, Excel, video, audio, and images.

The reason is to ingest a file into Hadoop is quite simple: copy the file into a directory
that is managed by Hadoop. It is from that point that Hadoop splits the file across the
multiple nodes or machines that it has registered as part of its cluster.

The second purpose for Hadoop (or best practice today) is to leverage it as a place to
perform data mining, utilizing SAS, or R, or textual mining. The results of the mining
efforts often are structured data sets that can and should be copied into relational
database engines, making them available for ad hoc querying.

240

235 236 237 238 239 240 241 242 243 244 245