Page 95 -
P. 95
49 Pros and cons of end-to-end learning
Consider the same speech pipeline from our earlier example:
Many parts of this pipeline were “hand-engineered”:
• MFCCs are a set of hand-designed audio features. Although they provide a reasonable
summary of the audio input, they also simplify the input signal by throwing some
information away.
• Phonemes are an invention of linguists. They are an imperfect representation of speech
sounds. To the extent that phonemes are a poor approximation of reality, forcing an
algorithm to use a phoneme representation will limit the speech system’s performance.
These hand-engineered components limit the potential performance of the speech system.
However, allowing hand-engineered components also has some advantages:
• The MFCC features are robust to some properties of speech that do not affect the content,
such as speaker pitch. Thus, they help simplify the problem for the learning algorithm.
• To the extent that phonemes are a reasonable representation of speech, they can also help
the learning algorithm understand basic sound components and therefore improve its
performance.
Having more hand-engineered components generally allows a speech system to learn with
less data. The hand-engineered knowledge captured by MFCCs and phonemes
“supplements” the knowledge our algorithm acquires from data. When we don’t have much
data, this knowledge is useful.
Now, consider the end-to-end system:
Page 95 Machine Learning Yearning-Draft Andrew Ng