Page 43 - Artificial Intelligence for the Internet of Everything
P. 43

30    Artificial Intelligence for the Internet of Everything


          A plethora of tools have been proposed specifically to minimize the empir-
          ical risk (sample size N is finite) in the case of GLMs, which achieve even
          faster linear or superlinear convergence rates. These methods are either based
          on reducing the variance of the SA (data-subsampling) error of the stochastic
          gradient (Defazio, Bach, & Lacoste-Julien, 2014; Johnson & Zhang, 2013;
          Schmidt, Roux, & Bach, 2013) or by using approximate second-order
          (Hessian) information (Goldfarb, 1970; Shanno & Phua, 1976). This thread
          has culminated in the fact that quasi-Newton methods (Mokhtari,
          G€urb€uzbalaban, & Ribeiro, 2016) outperform variance reduction methods
          (Hu, Pan, & Kwok, 2009) for finite-sum minimization when N is large scale.
          For specifics on stochastic quasi-Newton updates, see Mokhtari and Ribeiro
          (2015) and Byrd, Hansen, Nocedal, and Singer (2016). However, as
          N ! ∞, the analysis that yields linear or superlinear learning rates breaks
                                                                    2
          down, and the best one can hope for is Nesterov’s Oð1=t Þ rate
          (Nemirovski et al., 2009).

          2.3.7 Learning Feature Representations for Inference

          Transformations of data domains have become widely used in the past
          decades, due to their ability to extract useful information from input signals
          as a precursor to solving statistical inference problems. For instance, if the
          signal dimension is very large, dimensionality reduction is of interest, which
          may be approached with principal component analysis ( Jolliffe, 1986). If
          instead one would like to conduct multiresolution analysis, wavelets
          (Mallat, 2008) may be more appropriate. These techniques, which also
          include a k-nearest neighbor, are known as unsupervised or signal represen-
          tation learning (Murphy, 2012). Recently, methods based on learned rep-
          resentations, rather than those fixed a priori, have gained traction in
          pattern recognition (Elad & Aharon, 2006; Mairal, Elad, & Sapiro, 2008).
          A special case of data-driven representation learning is dictionary learning
          (Mairal, Bach, Ponce, Sapiro, & Zisserman, 2008), the focus of this section.
             Hereweaddressfindingadictionary(signalencoding)thatiswelladaptedto
          a specific inference task (Mairal, Bach, & Ponce, 2012). To do so, denote the
                           k
          coding αðD ;xÞ2  as a feature representation of the signal x t with respect to
          some dictionary matrix D2  p k . Typically, αðD ;xÞ is chosen as the solution


          to a lasso regression or approximate solution to an ‘ 0 constrained problem
                                                       T
          that minimizes some criterion of distance between D α and x to incentivize
                                                             k
          codes to be sparse. Further introduce the classifier w 2  that is used to

          predict target variable y t when given the signal encoding αðD ;xÞ.The merit
                                   k
          of the classifier w 2W    is measured by the smooth loss function
   38   39   40   41   42   43   44   45   46   47   48