Page 358 - From Smart Grid to Internet of Energy
P. 358
322 From smart grid to internet of energy
The kFold parameter denotes the list of RDD pairs for each k fold numbers and
test data. The filter is result of each new RDD with predicted values while map
applies transformation for each element along RDD sets. The zipWithIndex is
used to zip RDD with its elements. It is ordered firstly on the partition index and
then it is applied to each partition. Thus, the first partition is listed as Index 0 and
the last partition gets the highest index number.
Random forests are a popular family of classification and regression
methods in Spark APIs. They are used as ensembles of decision trees, and com-
bine many decision trees for reducing the risk of overfitting. The block diagram
shown in Fig. 8.5 is proposed by Garcia-Gil et al. where each partition is iterated
for a random forest model, and predicting test data are used in learned model.
Once the test data and predicted data are obtained to compare classes, zip-
WithIndex operation has been applied at each RDDs. Afterwards, the map func-
tion is operated for each RDD classes and for the predicted class. In case any
difference detected at comparison of predicted and actual classes, this situation
and differences are defined as noise that are removed by filter function [19].
Another sample noise filtering method proposed by [19] is heterogenous
ensemble (HTE-BD) method which is based on three classifiers as random for-
est, logistic regression, and kNN. Unlike heterogeneous ensemble algorithm,
the homogeneous one (HME-BD) was based on just random forest classifica-
tion algorithms. However, the learning algorithms used in HTE-BD provides
increased detection capability in noise filtering and data cleaning operations.
It is mainly improved regarding to ensemble filter with increased classification
algorithms. The decision tree classification algorithm is one of the most widely
used machine learning method used in data mining. The decision tree is started
with a single node and then each inherited outcome generates another node. The
Apache Spark provides optimization in decision tree scalability since it is grown
by several nodes. Random forest is the combination of decision trees and their
nodes that are collected by algorithms called ensembles. Spark provides indi-
vidual training capability to each tree along the random forest and thus, distrib-
uted operation feature is obtained for each random forest. The kNN which is
another component of HTE-BD proposed by Garcia-Gil et al. is a supervised
learning algorithm for classification. The algorithm presented below has been
improved regarding to ensemble filter where algorithm benefits from learning
algorithms such as decision tree, kNN, and linear machine learning. Authors
suggest to use Spark random forest model instead of pure decision tree algo-
rithm, and logistic regression has been used for linear machine learning. Each
train and test operation of input data are run by three algorithms as seen in
Fig. 8.6 for any fold from first one to kth one. The trained algorithms predict
the test data and creates RDD triples as (rf, lr, knn) which are compared with
original ones. The required input parameters are defined as database (data), par-
tition number (P), number of trees in Random Forest (nTrees) and the voting
strategy (vote).

