Closing the Hole Between Textual content and Speech Understanding in LLMs

Giant Language Fashions (LLMs) will be tailored to increase their textual content capabilities to speech inputs. Nevertheless, these speech-adapted LLMs constantly underperform their text-based counterparts—and even cascaded pipelines—on language understanding duties. We time period this shortfall the text-speech understanding hole: the efficiency drop noticed when a speech-adapted LLM processes spoken inputs relative to when the unique text-based LLM processes the equal textual content. Current approaches to narrowing this hole both depend on large-scale speech synthesis of textual content corpora, which is expensive and closely depending on artificial knowledge, or on large-scale proprietary speech datasets, which aren’t reproducible. In consequence, there stays a necessity for extra data-efficient alternate options for closing the text-speech understanding hole. On this work, we analyze the hole as pushed by two components: (i) forgetting of textual content capabilities throughout adaptation, and (ii) cross-modal misalignment between speech and textual content. Based mostly on this evaluation, we introduce SALAD—Pattern-efficient Alignment with Studying by Lively choice and cross-modal Distillation—which mixes cross-modal distillation with focused artificial knowledge to enhance alignment whereas mitigating forgetting. Utilized to 3B and 7B LLMs, SALAD achieves aggressive efficiency with a robust open-weight mannequin throughout broad-domain benchmarks in data, language understanding, and reasoning, whereas coaching on over an order of magnitude much less speech knowledge from public corpora.

† Université de Toulon, Aix Marseille Université, CNRS, LIS

Main Menu

What's Hot

AI to assist researchers see the larger image in cell biology | MIT Information

Forescout Launches VistaroAI™ to Assist Safety Groups Minimize Via AI Hype and Act Sooner on Actual Threats

Peacock Promo Codes: 40% Off February 2026

Closing the Hole Between Textual content and Speech Understanding in LLMs

Why Governance Has to Transfer Contained in the System – O’Reilly

A Full Information for Time Collection ML

Scaling information annotation utilizing vision-language fashions to energy bodily AI programs

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge

AI to assist researchers see the larger image in cell biology | MIT Information

Forescout Launches VistaroAI™ to Assist Safety Groups Minimize Via AI Hype and Act Sooner on Actual Threats

Peacock Promo Codes: 40% Off February 2026

Why Governance Has to Transfer Contained in the System – O’Reilly

Main Menu

Subscribe to Updates

What's Hot

Closing the Hole Between Textual content and Speech Understanding in LLMs

Related Posts