Finish-to-end (E2E) Computerized Speech Recognition (ASR) fashions are skilled utilizing paired audio-text samples which are costly to acquire, since high-quality ground-truth knowledge requires human annotators. Voice search purposes, akin to digital media gamers, leverage ASR to permit customers to look by voice versus an on-screen keyboard. Nonetheless, current or rare film titles will not be sufficiently represented within the E2E ASR system’s coaching knowledge, and therefore, could endure poor recognition.
On this paper, we suggest a phonetic correction system that consists of (a) a phonetic search primarily based on the ASR mannequin’s output that generates phonetic options that will not be thought of by the E2E system, and (b) a rescorer part that mixes the ASR mannequin recognition and the phonetic options, and choose a last system output.
We discover that our method improves phrase error charge between 4.4 and seven.6% relative on benchmarks of fashionable film titles over a collection of aggressive baselines.
- ** Work carried out whereas at Apple
- † Meta