This paper was accepted to the IEEE Spoken Language Expertise Workshop (SLT) 2024.
Neural contextual biasing permits speech recognition fashions to leverage contextually related data, resulting in improved transcription accuracy. Nonetheless, the biasing mechanism is usually primarily based on a cross-attention module between the audio and a list of biasing entries, which suggests computational complexity can pose extreme sensible limitations on the scale of the biasing catalogue and consequently on accuracy enhancements. This work proposes an approximation to cross-attention scoring primarily based on vector quantization and permits compute- and memory-efficient use of huge biasing catalogues. We suggest to make use of this method collectively with a retrieval primarily based contextual biasing strategy. First, we use an environment friendly quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Because the proposed strategy is agnostic to the biasing methodology, we examine utilizing full cross-attention, LLM prompting, and a mixture of the 2. We present that retrieval primarily based shortlisting permits the system to effectively leverage biasing catalogues of a number of hundreds of entries, leading to as much as 71% relative error price discount in private entity recognition. On the identical time, the proposed approximation algorithm reduces compute time by 20% and reminiscence utilization by 85-95%, for lists of as much as a million entries, when in comparison with commonplace dot-product cross-attention.

