Page 152 - Intelligent Digital Oil And Gas Fields
P. 152
Components of Artificial Intelligence and Data Analytics 115
Apache Hadoop (Ghemawat et al., 2003; Handy, 2015) or NoSQL
(Pokorny, 2011), distributed on platforms such as Cloudera, Hortonworks
and MapReduce (Dean and Ghemawat, 2008) or Apache Spark.
Recently, an overwhelming amount of literature has been published
about Big Data concepts. Two publications that we recommend include
“Harness the Power of Big Data” by Zikopoulos et al. (2013) and Harness
Oil and Gas Big Data with Analytics: Optimize Exploration and Production with
Data Driven Models by Holdaway (2014).
4.2 INTELLIGENT DATA ANALYTICS
AND VISUALIZATION
4.2.1 Data Mining
Data mining (DM) is a knowledge discovery from large quantities of data.
The process derives its name from the similarity between searching for valu-
able business information in a large database, containing terabytes or even
petabytes of data, and mining a mountain for a vein of valuable ore. Tech-
nically, the term refers to the process of extracting useful models and patterns
that are (Leskovec et al., 2014)
• valid (i.e., contain new data with some certainty),
• useful (i.e., add value and enable people to take related actions),
• unexpected (i.e., nonobvious and nonintuitive, spurring the “aha!”
moment), and
• understandable (i.e., humans should be able to interpret and analyze them).
Data mining as a discipline overlaps with database systems, statistics, and
ML, and, as such, the complexity when dealing with data in data mining
applications can be graphically represented as shown in Fig. 4.6.
As data come in a variety of modalities, formats, and ontologies—from
structured, unstructured, static to streaming, descriptive to Boolean—this
infers that for successful data mining, the data need to be properly collected,
stored, and managed. Ideally, these tasks would be performed continuously
by the data operators; however, in reality (as is frequently the case in the E&P
industry), the data presented for mining is imperfect, with missing, illogical,
and nonphysical values that require extensive QA/QC processing, with
missing data interpolation and imputation (van Buuren and Groothuis-
Oudshoorn, 2011).
Historically, statisticians were the first to use the term “data mining,”
ironically focusing primarily on the attempts to extract the information that
was not supported by the data. However, with the evolution of statistical