Page 240 - Data Architecture
P. 240

Chapter 6.3: Introduction to Data Vault Architecture

           How Does NoSQL Fit in to the Architecture?



           NoSQL platform implementations will vary. Some will contain SQL-like interfaces; some
           will contain relational database technology integrated with nonrelational technology. The
           line between the two (RDBMS and NoSQL) will continue to be blurred. Eventually, it
           will be a “data management system” capable of housing both relational and nonrelational
           simply by design.


           The NoSQL platform today, in most cases, is based on Hadoop at its core—which is
           composed of the Hadoop distributed file system (HDFS) or metadata management for
           files in the different directories. Various implementations of SQL access layers and in-

           memory technology will sit on top of the HDFS.

           Once atomicity, consistency, isolation, and durability (ACID) compliance is achieved

           (which is available today with some NoSQL vendors), the differentiation between
           RDBMS and NoSQL will fade. Note that not all Hadoop or NoSQL platforms offer ACID
           compliance today and not all NoSQL platforms offer update of records in place making it
           impossible to completely supplant the RDBMS technology.


           This is changing quickly. Even as this section is written, the technology continues to
           advance. Eventually, the technology will be seamless, and what is purchased from the
           vendors in this space will be hybrid-based.


           Current positioning of a platform like Hadoop is to utilize it or leverage it as an ingestion
           area and a staging area for any and all data that might proceed to the warehouse. This
           includes structured data sets (delimited files and fixed-width columnar files);
           multistructured data sets like XML and JSON files; and unstructured data like Word
           documents, Excel, video, audio, and images.


           The reason is to ingest a file into Hadoop is quite simple: copy the file into a directory
           that is managed by Hadoop. It is from that point that Hadoop splits the file across the
           multiple nodes or machines that it has registered as part of its cluster.


           The second purpose for Hadoop (or best practice today) is to leverage it as a place to
           perform data mining, utilizing SAS, or R, or textual mining. The results of the mining
           efforts often are structured data sets that can and should be copied into relational
           database engines, making them available for ad hoc querying.




                                                                                                               240
   235   236   237   238   239   240   241   242   243   244   245