Page 42 - Artificial Intelligence for the Internet of Everything

P. 42

Uncertainty Quantification in Internet of Battlefield Things 29

gradient descent (Nesterov, 2004). Currently, there has not been a lot of
attention given to NAGD in the stochastic setting (though, see Yang
et al., 2016). We would like to extend NAGD to several problem settings
in a stochastic context. Additionally, we would like to incorporate adaptive
sampling (using varying batch sizes) into a stochastic NAGD. Adaptive
sampling has shown promise, yet has not received a lot of attention.
The development of a more-efficient SGD algorithm would reduce
uncertainty in the training of ML models at a faster rate than current algo-
rithms. This is especially important for online learning, where it is crucial
that algorithms adapt efficiently to newly observed data in the field.

2.3.6 Generalized Linear Models

p
The first andsimplest choice of estimator functionclass is F¼  .In this case,
T
the estimator is a generalized linear model (GLM): ^yðxÞ¼ w x for some
p
parameter vector w 2  (Nelder & Baker, 1972). In this case optimizing
the statistical loss is the stochastic convex optimization problem, stated as:
T
(2.3)
min  x,y ½‘ðw x,yÞ
w2 p
Observe that to optimize Eq. (2.3), assuming a closed-form solution is una-
vailable, a gradient descent or Newton’s method must be used (Boyd & Van-
derberghe, 2004). However, either method requires computing the gradient
T
of LðwÞ :¼  x,y ½‘ðw x,yÞ, which requires infinitely many realizations
(x n , y n ) of the random pair (x, y), and thus has infinite complexity. This
computational bottleneck has been resolved through the development of
stochastic approximation (SA) methods (Bottou, 1998 a; Robbins & Monro,
1951), which operate on subsets of data examples per step. The most com-
mon SA method is the SGD, which involves descending along the stochastic
T
gradient r w ‘(w x t , y t ) rather than the true gradient at each step:
T
t
t
t
w t +1 ¼ w t η r w ‘ðw x t ,y Þ (2.4)
Use of SGD (Eq. 2.4) is prevalent due to its simplicity, ease of use, and the
fact that it converges to the minimizer of Eq. (2.3) almost certainly, and in
p ﬃﬃ
expectation at a Oð1= tÞ rate when L(w) is convex and a sublinearly Oð1=tÞ
when it is strongly convex. Efforts to improve the mean convergence rate to
2
Oð1=t Þ through the use of Nesterov acceleration (Nemirovski, Juditsky,
Lan, & Shapiro, 2009) have also been developed, whose updates are given as:
T
t
t
t
w t +1 ¼ v t η r w ‘ðv x t ,y Þ (2.5)
v t +1 ¼ð1 γ Þw t +1 + γ w t
t
t

37 38 39 40 41 42 43 44 45 46 47