Page 41 - Artificial Intelligence for the Internet of Everything
P. 41
28 Artificial Intelligence for the Internet of Everything
but of course requires n 1 more samples at each iteration than the standard
SGD. Thus it is not entirely clear which algorithm is better. If one could
perform mini-batch in parallel with little overhead cost, the resulting algo-
rithm would achieve the described bound without the cost of sampling n
draws sequentially. Otherwise, the tradeoff is somewhat ambiguous.
2.3.4.2 SGD With Momentum
Another issue with standard SGD is that it does not take into account scaling
of the parameter space (which would require second-order information).
A particular situation where this is an issue is when the objective function
has very narrow level curves. Here, a standard gradient descent method
tends to zigzag back and forth, moving almost perpendicular to the opti-
mum. A common solution in the deterministic case is to use gradient descent
with momentum, which has a simple yet helpful addition to the descent step.
The stochastic version is:
k
w k +1 ¼ w k α k rFðw k ,ξ Þ + βðw k w k 1 Þ
where the last term is the momentum term.
As the name implies, this algorithm is motivated by the physical idea of
the inertia, for example, of a ball rolling down a hill. It might make sense to
incorporate the previous step length this way into the next search direction.
In this sense, the algorithm “remembers” all previous step distances, in the
form of an exponentially decaying average. Momentum helps SGD avoid
getting stuck in narrow valleys and is popular for training neural networks.
Unfortunately, convergence for momentum is not as well understood. An
upper bound for a convex, nonsmooth objective is given by Yang, Lin, and
Li (2016) in the form of:
x 2 2 2
βE 0 ð1 βÞðE Þ CðL + δ Þ
0
+ ffiffiffiffiffiffiffiffiffiffiffiffiffiffi +
½f ðx k Þ f ðx*Þ p p ffiffiffiffiffiffiffiffiffiffi
2C 2ð1 βÞ k +1
ð1 βÞðk +1Þ ðk +1Þ
Note here that the error is given in terms of the average of the iterate
P k
x k ¼ 1=k i¼1 i . For problems in this class the optimal convergence rate
x
p ffiffiffi
is Oð1= kÞ, which this algorithm achieves.
2.3.5 Nesterov’s Accelerated Gradient Descent
Nesterov’s accelerated gradient descent (NAGD) algorithm for determinis-
tic settings has been shown to be optimal for a variety of problem assump-
tions. For example, in the case where the objective is smooth and strongly
convex, NAGD achieves the lower complexity bound, unlike standard