Page 174 - Building Big Data Applications
P. 174
Chapter 9 Governance 173
is returned as personalized offers to the user. Often sponsors of specific products and
services provide such offers with incentives which are presented to the user by the
recommender algorithm output.
How does machine learning use metadata and master data? In the search example we
discussed, the metadata is derived for the search elements and tagged with additional
data as available. This data is compared and processed with the data from the knowledge
repository, which includes semantic libraries, and master data catalogs when the ma-
chine learning algorithm is executed. The combination of metadata and master data
along with use of semantic libraries provides a better quality of data to the machine
learning algorithm, which in turn produces better quality of output for use by hypothesis
and prediction workflows.
Processing data that is very numeric like sensor data, or financial data or credit card
data will be based on patterns of numbers that execute as data inputs. These patterns are
processed through several mathematical models and their outputs are stored in the
knowledge repository that then shared the stored results back into the processing loop in
the machine learning implementation.
Processing data such as images and videos uses conversion techniques to create
mathematical datasets for all the nontextual elements. These mathematical datasets are
processed through several combinations of data mining and machine learning algo-
rithms including statistical analysis, linear regression, and polynomial curve fitting
techniques, to create outputs. These outputs are processed further to create a noise free
set of outputs, which can be used for recreating the digital models of images or video
data (image only and not audio). Audio is processed as separate feeds and associated
with video processing datasets as needed.
Machine-learning techniques reduce the complexity of processing big data. The most
common and popular algorithms for machine learning with web-sale data processing are
available in the open-source foundation as Apache Mahout project. Mahout is designed
to be deployed on Hadoop with minimal configuration efforts and can scale very
effectively. While not all machine learning algorithms mandate the need for an enter-
prise data scientist, this is definitely the most complex area in the processing of large
datasets and having a team of data scientists will definitely be useful for any enterprise.
As we see from the discussions in this chapter, processing big data applications is
indeed a complex and challenging process. Since the room for error in this type of
processing is very minimal if allowed, the quality of the data used for processing needs to
be very pristine. This can be accomplished by implementing a data-driven architecture
that uses all the enterprise data assets available to create a powerful foundation for
analysis and integration of data across the Big Data and the DBMS. This foundational
architecture is what defines the next generation of data warehouse, where all types of
data are stored and processed to empower the enterprise toward making and executing
profitable decisions.