Giant Language Fashions (LLMs) will be tailored to increase their textual content capabilities to speech inputs. Nevertheless, these speech-adapted LLMs constantly underperform their text-based counterparts—and even cascaded pipelines—on language understanding duties. We time period this shortfall the text-speech understanding hole: the efficiency drop noticed when a speech-adapted LLM processes spoken inputs relative to when the unique text-based LLM processes the equal textual content. Current approaches to narrowing this hole both depend on large-scale speech synthesis of textual content corpora, which is expensive and closely depending on artificial knowledge, or on large-scale proprietary speech datasets, which aren’t reproducible. In consequence, there stays a necessity for extra data-efficient alternate options for closing the text-speech understanding hole. On this work, we analyze the hole as pushed by two components: (i) forgetting of textual content capabilities throughout adaptation, and (ii) cross-modal misalignment between speech and textual content. Based mostly on this evaluation, we introduce SALAD—Pattern-efficient Alignment with Studying by Lively choice and cross-modal Distillation—which mixes cross-modal distillation with focused artificial knowledge to enhance alignment whereas mitigating forgetting. Utilized to 3B and 7B LLMs, SALAD achieves aggressive efficiency with a robust open-weight mannequin throughout broad-domain benchmarks in data, language understanding, and reasoning, whereas coaching on over an order of magnitude much less speech knowledge from public corpora.
- † Université de Toulon, Aix Marseille Université, CNRS, LIS

