Page 95 -

P. 95

49 Pros and cons of end-to-end learning

Consider the same speech pipeline from our earlier example:

Many parts of this pipeline were “hand-engineered”:

• MFCCs are a set of hand-designed audio features. Although they provide a reasonable
summary of the audio input, they also simplify the input signal by throwing some
information away.

• Phonemes are an invention of linguists. They are an imperfect representation of speech
sounds. To the extent that phonemes are a poor approximation of reality, forcing an
algorithm to use a phoneme representation will limit the speech system’s performance.

These hand-engineered components limit the potential performance of the speech system.
However, allowing hand-engineered components also has some advantages:

• The MFCC features are robust to some properties of speech that do not affect the content,

such as speaker pitch. Thus, they help simplify the problem for the learning algorithm.

• To the extent that phonemes are a reasonable representation of speech, they can also help
the learning algorithm understand basic sound components and therefore improve its
performance.

Having more hand-engineered components generally allows a speech system to learn with
less data. The hand-engineered knowledge captured by MFCCs and phonemes
“supplements” the knowledge our algorithm acquires from data. When we don’t have much
data, this knowledge is useful.

Now, consider the end-to-end system:

Page 95 Machine Learning Yearning-Draft Andrew Ng

90 91 92 93 94 95 96