Page 95 -
P. 95

49 Pros and cons of end-to-end learning


             Consider the same speech pipeline from our earlier example:











             Many parts of this pipeline were “hand-engineered”:


             •   MFCCs are a set of hand-designed audio features. Although they provide a reasonable
                 summary of the audio input, they also simplify the input signal by throwing some
                 information away.


             •   Phonemes are an invention of linguists. They are an imperfect representation of speech
                 sounds. To the extent that phonemes are a poor approximation of reality, forcing an
                 algorithm to use a phoneme representation will limit the speech system’s performance.

             These hand-engineered components limit the potential performance of the speech system.
             However, allowing hand-engineered components also has some advantages:

             •   The MFCC features are robust to some properties of speech that do not affect the content,

                 such as speaker pitch. Thus, they help simplify the problem for the learning algorithm.

             •   To the extent that phonemes are a reasonable representation of speech, they can also help
                 the learning algorithm understand basic sound components and therefore improve its
                 performance.


             Having more hand-engineered components generally allows a speech system to learn with
             less data. The hand-engineered knowledge captured by MFCCs and phonemes
             “supplements” the knowledge our algorithm acquires from data. When we don’t have much
             data, this knowledge is useful.

             Now, consider the end-to-end system:













             Page 95                            Machine Learning Yearning-Draft                       Andrew Ng
   90   91   92   93   94   95   96