Machine-directed speech detection (DDSD) is a binary classification job that separates the consumer’s queries to a voice assistant (VA) from background speech or facet conversations. That is vital for attaining naturalistic consumer expertise. To this finish, we suggest information distillation (KD) to reinforce DDSD accuracy whereas making certain environment friendly deployment. Particularly, we introduce a novel adaptive KD methodology that transfers information from basic representations of an ASR giant pre-trained acoustic encoder (trainer). We apply task-specific adapters, on high of the (frozen) trainer encoder, skilled collectively with the scholar mannequin on DDSD. We show that the proposed adaptive KD outperforms the scholar mannequin with out distillation within the key phrase and keyword-free (follow-up) invocations, with an enchancment of +26% and +19% by way of Equal Error Fee, respectively. We additionally present that this strategy generalizes throughout the transformer and conformer-based mannequin architectures.
- † Meta
- ** Work accomplished whereas at Apple

