Page 126 - Building Big Data Applications
P. 126
Chapter 6 Visualization, storyboarding and applications 123
Data taggingdis the process of creating an identifying link on the data for
metadata integration.
Data classificationdis the process of creating subsets of value pairs for data
processing and integration. An example of this is extracting website URL in
clickstream data along with page-view information.
Data modelingdis the process of creating a model for data visualization or
analytics. The output from this step can be combined into an extraction exercise.
Once the data is prepared for analysis in the discovery stage, the users can extract
the result sets from any stage and use it for integration. These steps require a skill
combination of data analytics and statistical modeling, which is the role of a data
scientist. The question that confronts the users today is how to do the data discovery,
do you develop MapReduce code extensively or do you use software like Tableau or
Apache Presto. The answer to this question is simple, rather than develop extensive
lines of MapReduce code, which may not be reusable, you can adopt to using data
discovery and analysis tools that actually can produce the MapReduce code based on
the operations that you execute.
Depending on whichever method you choose to architect the solution your data
discovery framework is the key to developing big data analytics within your organization.
Once the data is ready for visualization, you can integrate the data with mash-ups and
other powerful visualization tools and provide the dashboards to the users.
Visualization
Big data visualization is not like traditional business intelligence where the data is
interactive and can be processed as drill downs and roll-ups in a hierarchy or can be
drilled into in a real-time fashion. This data is static in nature and will be minimally
interactive in a visualization situation. The underlying reason for this static nature is due
to the design of the big data platform like Hadoop or NoSQL, where the data is stored in
files and not in table structured, and processing changes will require massive file
operations, which are best, performed in a microbatch environment as opposed to a
real-time environment. This limitation is being addressed in the next generation of
Hadoop and other big data platforms.
Today the data that is available for visualization is largely integrated using mash-up
tools and software that support such functionality including Tableau and Spotfire. The
mash-up platform provides the capability for the user to integrate data from multiple
streams into one picture, by linking common data between the different datasets.
For example, if you are looking at integrating customer sentiment analytics with
campaign data, field sales data, and competitive research data, the mash-up that will be
created to view all of this information will be integrating the customer sentiment with
campaign data using product and geography information, the competitive research data
and the campaign data by using geography information, the sales data and the campaign