Page 202 - Building Big Data Applications
P. 202
202 Building Big Data Applications
the data, and most important of all the need to curb the islands of misfit toys from
reoccurring. In this aspect we have evolved the data swamp and data lake layers as the
responses to these challenges. The ability to store data in raw format, defer the modeling
of it until time of analysis and the compelling economics of cloud storage and distrib-
uted file systems has provided answers to manage the problem. The new infrastructure
model has evolved quickly and created many offerings for different kinds of enterprises
based on size, complexity, usage, and data (Fig. 11.5).
As we evolved the model of computing in the cloud for the enterprise and have
successfully adopted the data lake model, the data warehouse is still needed for
corporate analytical computes and it has expanded with an addition of the data lake and
data swamp layers in the upstream and analytical data hubs downstream. This means we
need to manage the data journey from the swamp to the hub, and maintain all lineage,
traceability, and transformation logistics, which need to be available on demand.
The multiple layers need to coexist, and our tools and platforms need to be designed
and architected to accommodate this heterogeneity. That coexistence is not well
accommodated by the tools market, which is largely split along data warehouse-data
lake lines. Older, established tools that predate Hadoop and data lakes were designed
to work with relational database management systems. Newer tools that grew up in the
big data era are more focused on managing individual data files kept in cloud storage
systems like Amazon S3 or distributed file systems such as Hadoop’s HDFS. The foun-
dational issue is how do we marry the two? (Fig. 11.6).
Enterprises do not want a broken tool chain, they want technologies that can straddle the
line and work with platforms on either side of it. They all have multiple sets of data tech-
nologies, and the data in each must be leveraged together, to benefit the enterprise. This
requires the different databases, data warehouses, data swamps, and other systems with all
FIGURE 11.5 Cloud computing.