Page 37 - Artificial Intelligence for the Internet of Everything
P. 37
24 Artificial Intelligence for the Internet of Everything
where expectation is taken with respect to the probability distribution of
ξ ¼ (X, Y ), the explanatory, and response variables. In ML, F(w;ξ)is
typically of the form:
Fðw;ξÞ¼ ‘ðhðX,wÞ YÞ
where ‘ is some loss function, such as k k 2 . Most optimization algorithms
attempt to optimize what is called the sample-path problem, the solution
of which is often referred to as the empirical risk minimizer (ERM). This
strategy attempts to solve an approximation to f(w) using a finite amount
of training data:
n
1 X
^
i
min f ðwÞ¼ ‘ðhðx i ,wÞ y Þ
w n
i¼1
This is the problem that the bulk of ML algorithms attempt to minimize.
A particular instance is maximum likelihood estimation, perhaps the most
widely used estimation method among statisticians. For computer scientists
minimizing a loss function (such as ‘ 2 loss) without an explicit statistical
model is more common.
^
Although it may go without saying that minimizing f ðwÞ is not the end
goal. Ideally we wish to minimize the true objective, f(w). For finite sample
sizes one runs into issues of overfitting to the sample problem. An adequate
solution to the sample problem may not generalize well to future predictions
(especially true for highly complex, over-parameterized models). Some
ways of combating this problem include early-stopping, regularization,
and Bayesian methods. In a large data setting, however, we can work around
this issue. Due to such large sample sizes, we have an essentially infinite
amount of data. Additionally, we are often in an online setting where data
are continuously incoming, and we would like our model to continuously
learn from new data. In this case we can model data as coming from an oracle
that draws from the true probability distribution P ξ (such a scheme is fairly
standard among the stochastic optimization community, but is more rare in
ML). In this manner we can optimize the true risk (1) with respect to the
objective function, f(w).
2.3.2 Stochastic Gradient Descent Algorithm
Algorithm 2.1 shows a generic SGD algorithm to be used to solve (1). The
direction of each iteration, g(x k , ξ k ) depends on the current iterate x k as well
as a random draw ξ i P ξ .