Page 43 - Artificial Intelligence for the Internet of Everything

P. 43

30 Artificial Intelligence for the Internet of Everything

A plethora of tools have been proposed specifically to minimize the empir-
ical risk (sample size N is finite) in the case of GLMs, which achieve even
faster linear or superlinear convergence rates. These methods are either based
on reducing the variance of the SA (data-subsampling) error of the stochastic
gradient (Defazio, Bach, & Lacoste-Julien, 2014; Johnson & Zhang, 2013;
Schmidt, Roux, & Bach, 2013) or by using approximate second-order
(Hessian) information (Goldfarb, 1970; Shanno & Phua, 1976). This thread
has culminated in the fact that quasi-Newton methods (Mokhtari,
G€urb€uzbalaban, & Ribeiro, 2016) outperform variance reduction methods
(Hu, Pan, & Kwok, 2009) for finite-sum minimization when N is large scale.
For specifics on stochastic quasi-Newton updates, see Mokhtari and Ribeiro
(2015) and Byrd, Hansen, Nocedal, and Singer (2016). However, as
N ! ∞, the analysis that yields linear or superlinear learning rates breaks
2
down, and the best one can hope for is Nesterov’s Oð1=t Þ rate
(Nemirovski et al., 2009).

2.3.7 Learning Feature Representations for Inference

Transformations of data domains have become widely used in the past
decades, due to their ability to extract useful information from input signals
as a precursor to solving statistical inference problems. For instance, if the
signal dimension is very large, dimensionality reduction is of interest, which
may be approached with principal component analysis ( Jolliffe, 1986). If
instead one would like to conduct multiresolution analysis, wavelets
(Mallat, 2008) may be more appropriate. These techniques, which also
include a k-nearest neighbor, are known as unsupervised or signal represen-
tation learning (Murphy, 2012). Recently, methods based on learned rep-
resentations, rather than those fixed a priori, have gained traction in
pattern recognition (Elad & Aharon, 2006; Mairal, Elad, & Sapiro, 2008).
A special case of data-driven representation learning is dictionary learning
(Mairal, Bach, Ponce, Sapiro, & Zisserman, 2008), the focus of this section.
Hereweaddressfindingadictionary(signalencoding)thatiswelladaptedto
a specific inference task (Mairal, Bach, & Ponce, 2012). To do so, denote the
k
coding αðD ;xÞ2  as a feature representation of the signal x t with respect to
some dictionary matrix D2  p k . Typically, αðD ;xÞ is chosen as the solution

to a lasso regression or approximate solution to an ‘ 0 constrained problem
T
that minimizes some criterion of distance between D α and x to incentivize
k
codes to be sparse. Further introduce the classifier w 2  that is used to

predict target variable y t when given the signal encoding αðD ;xÞ.The merit
k
of the classifier w 2W  is measured by the smooth loss function

38 39 40 41 42 43 44 45 46 47 48