Page 37 - Artificial Intelligence for the Internet of Everything
P. 37

24    Artificial Intelligence for the Internet of Everything


          where expectation is taken with respect to the probability distribution of
          ξ ¼ (X, Y ), the explanatory, and response variables. In ML, F(w;ξ)is
          typically of the form:

                                Fðw;ξÞ¼ ‘ðhðX,wÞ YÞ
          where ‘ is some loss function, such as k k 2 . Most optimization algorithms
          attempt to optimize what is called the sample-path problem, the solution
          of which is often referred to as the empirical risk minimizer (ERM). This
          strategy attempts to solve an approximation to f(w) using a finite amount
          of training data:
                                          n
                                       1  X
                                 ^
                                                       i
                             min f ðwÞ¼     ‘ðhðx i ,wÞ y Þ
                              w        n
                                         i¼1
          This is the problem that the bulk of ML algorithms attempt to minimize.
          A particular instance is maximum likelihood estimation, perhaps the most
          widely used estimation method among statisticians. For computer scientists
          minimizing a loss function (such as ‘ 2 loss) without an explicit statistical
          model is more common.
                                                           ^
             Although it may go without saying that minimizing f ðwÞ is not the end
          goal. Ideally we wish to minimize the true objective, f(w). For finite sample
          sizes one runs into issues of overfitting to the sample problem. An adequate
          solution to the sample problem may not generalize well to future predictions
          (especially true for highly complex, over-parameterized models). Some
          ways of combating this problem include early-stopping, regularization,
          and Bayesian methods. In a large data setting, however, we can work around
          this issue. Due to such large sample sizes, we have an essentially infinite
          amount of data. Additionally, we are often in an online setting where data
          are continuously incoming, and we would like our model to continuously
          learn from new data. In this case we can model data as coming from an oracle
          that draws from the true probability distribution P ξ (such a scheme is fairly
          standard among the stochastic optimization community, but is more rare in
          ML). In this manner we can optimize the true risk (1) with respect to the
          objective function, f(w).


          2.3.2 Stochastic Gradient Descent Algorithm

          Algorithm 2.1 shows a generic SGD algorithm to be used to solve (1). The
          direction of each iteration, g(x k , ξ k ) depends on the current iterate x k as well
          as a random draw ξ i   P ξ .
   32   33   34   35   36   37   38   39   40   41   42