Page 174 - Building Big Data Applications

P. 174

Chapter 9 Governance 173

is returned as personalized offers to the user. Often sponsors of speciﬁc products and
services provide such offers with incentives which are presented to the user by the
recommender algorithm output.
How does machine learning use metadata and master data? In the search example we
discussed, the metadata is derived for the search elements and tagged with additional
data as available. This data is compared and processed with the data from the knowledge
repository, which includes semantic libraries, and master data catalogs when the ma-
chine learning algorithm is executed. The combination of metadata and master data
along with use of semantic libraries provides a better quality of data to the machine
learning algorithm, which in turn produces better quality of output for use by hypothesis
and prediction workﬂows.
Processing data that is very numeric like sensor data, or ﬁnancial data or credit card
data will be based on patterns of numbers that execute as data inputs. These patterns are
processed through several mathematical models and their outputs are stored in the
knowledge repository that then shared the stored results back into the processing loop in
the machine learning implementation.
Processing data such as images and videos uses conversion techniques to create
mathematical datasets for all the nontextual elements. These mathematical datasets are
processed through several combinations of data mining and machine learning algo-
rithms including statistical analysis, linear regression, and polynomial curve ﬁtting
techniques, to create outputs. These outputs are processed further to create a noise free
set of outputs, which can be used for recreating the digital models of images or video
data (image only and not audio). Audio is processed as separate feeds and associated
with video processing datasets as needed.
Machine-learning techniques reduce the complexity of processing big data. The most
common and popular algorithms for machine learning with web-sale data processing are
available in the open-source foundation as Apache Mahout project. Mahout is designed
to be deployed on Hadoop with minimal conﬁguration efforts and can scale very
effectively. While not all machine learning algorithms mandate the need for an enter-
prise data scientist, this is deﬁnitely the most complex area in the processing of large
datasets and having a team of data scientists will deﬁnitely be useful for any enterprise.
As we see from the discussions in this chapter, processing big data applications is
indeed a complex and challenging process. Since the room for error in this type of
processing is very minimal if allowed, the quality of the data used for processing needs to
be very pristine. This can be accomplished by implementing a data-driven architecture
that uses all the enterprise data assets available to create a powerful foundation for
analysis and integration of data across the Big Data and the DBMS. This foundational
architecture is what deﬁnes the next generation of data warehouse, where all types of
data are stored and processed to empower the enterprise toward making and executing
proﬁtable decisions.

169 170 171 172 173 174 175 176 177 178 179