Page 41 - Artificial Intelligence for the Internet of Everything
P. 41

28    Artificial Intelligence for the Internet of Everything


          but of course requires n   1 more samples at each iteration than the standard
          SGD. Thus it is not entirely clear which algorithm is better. If one could
          perform mini-batch in parallel with little overhead cost, the resulting algo-
          rithm would achieve the described bound without the cost of sampling n
          draws sequentially. Otherwise, the tradeoff is somewhat ambiguous.

          2.3.4.2 SGD With Momentum
          Another issue with standard SGD is that it does not take into account scaling
          of the parameter space (which would require second-order information).
          A particular situation where this is an issue is when the objective function
          has very narrow level curves. Here, a standard gradient descent method
          tends to zigzag back and forth, moving almost perpendicular to the opti-
          mum. A common solution in the deterministic case is to use gradient descent
          with momentum, which has a simple yet helpful addition to the descent step.
          The stochastic version is:

                                              k
                        w k +1 ¼ w k  α k rFðw k ,ξ Þ + βðw k  w k 1 Þ
          where the last term is the momentum term.
             As the name implies, this algorithm is motivated by the physical idea of
          the inertia, for example, of a ball rolling down a hill. It might make sense to
          incorporate the previous step length this way into the next search direction.
          In this sense, the algorithm “remembers” all previous step distances, in the
          form of an exponentially decaying average. Momentum helps SGD avoid
          getting stuck in narrow valleys and is popular for training neural networks.
          Unfortunately, convergence for momentum is not as well understood. An
          upper bound for a convex, nonsmooth objective is given by Yang, Lin, and
          Li (2016) in the form of:
                                                     x 2        2   2
                                   βE 0      ð1 βÞðE Þ      CðL + δ Þ
                                                     0
                                           +      ffiffiffiffiffiffiffiffiffiffiffiffiffiffi +
             ½f ðx k Þ f ðx*ފ                 p                 p ffiffiffiffiffiffiffiffiffiffi
                                             2C           2ð1 βÞ k +1
                              ð1 βÞðk +1Þ         ðk +1Þ
          Note here that the error is given in terms of the average of the iterate
                  P k
          x k ¼ 1=k  i¼1 i . For problems in this class the optimal convergence rate
                       x
                 p ffiffiffi
          is Oð1= kÞ, which this algorithm achieves.
          2.3.5 Nesterov’s Accelerated Gradient Descent
          Nesterov’s accelerated gradient descent (NAGD) algorithm for determinis-
          tic settings has been shown to be optimal for a variety of problem assump-
          tions. For example, in the case where the objective is smooth and strongly
          convex, NAGD achieves the lower complexity bound, unlike standard
   36   37   38   39   40   41   42   43   44   45   46