Page 358 - From Smart Grid to Internet of Energy
        P. 358
     322  From smart grid to internet of energy
            The kFold parameter denotes the list of RDD pairs for each k fold numbers and
            test data. The filter is result of each new RDD with predicted values while map
            applies transformation for each element along RDD sets. The zipWithIndex is
            used to zip RDD with its elements. It is ordered firstly on the partition index and
            then it is applied to each partition. Thus, the first partition is listed as Index 0 and
            the last partition gets the highest index number.
               Random forests are a popular family of classification and regression
            methods in Spark APIs. They are used as ensembles of decision trees, and com-
            bine many decision trees for reducing the risk of overfitting. The block diagram
            shown in Fig. 8.5 is proposed by Garcia-Gil et al. where each partition is iterated
            for a random forest model, and predicting test data are used in learned model.
            Once the test data and predicted data are obtained to compare classes, zip-
            WithIndex operation has been applied at each RDDs. Afterwards, the map func-
            tion is operated for each RDD classes and for the predicted class. In case any
            difference detected at comparison of predicted and actual classes, this situation
            and differences are defined as noise that are removed by filter function [19].
               Another sample noise filtering method proposed by [19] is heterogenous
            ensemble (HTE-BD) method which is based on three classifiers as random for-
            est, logistic regression, and kNN. Unlike heterogeneous ensemble algorithm,
            the homogeneous one (HME-BD) was based on just random forest classifica-
            tion algorithms. However, the learning algorithms used in HTE-BD provides
            increased detection capability in noise filtering and data cleaning operations.
            It is mainly improved regarding to ensemble filter with increased classification
            algorithms. The decision tree classification algorithm is one of the most widely
            used machine learning method used in data mining. The decision tree is started
            with a single node and then each inherited outcome generates another node. The
            Apache Spark provides optimization in decision tree scalability since it is grown
            by several nodes. Random forest is the combination of decision trees and their
            nodes that are collected by algorithms called ensembles. Spark provides indi-
            vidual training capability to each tree along the random forest and thus, distrib-
            uted operation feature is obtained for each random forest. The kNN which is
            another component of HTE-BD proposed by Garcia-Gil et al. is a supervised
            learning algorithm for classification. The algorithm presented below has been
            improved regarding to ensemble filter where algorithm benefits from learning
            algorithms such as decision tree, kNN, and linear machine learning. Authors
            suggest to use Spark random forest model instead of pure decision tree algo-
            rithm, and logistic regression has been used for linear machine learning. Each
            train and test operation of input data are run by three algorithms as seen in
            Fig. 8.6 for any fold from first one to kth one. The trained algorithms predict
            the test data and creates RDD triples as (rf, lr, knn) which are compared with
            original ones. The required input parameters are defined as database (data), par-
            tition number (P), number of trees in Random Forest (nTrees) and the voting
            strategy (vote).





