Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Seth Godin on Management, Vulnerability, and Making an Influence within the New World Of Work

    March 14, 2026

    mAceReason-Math: A Dataset of Excessive-High quality Multilingual Math Issues Prepared For RLVR

    March 14, 2026

    AMC Robotics and HIVE Announce Collaboration to Advance AI-Pushed Robotics Compute Infrastructure

    March 14, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»Information-Centric Classes To Enhance Speech-Language Pretraining
    Machine Learning & Research

    Information-Centric Classes To Enhance Speech-Language Pretraining

    Oliver ChambersBy Oliver ChambersDecember 28, 2025No Comments1 Min Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Information-Centric Classes To Enhance Speech-Language Pretraining
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    Spoken Query-Answering (SQA) is a core functionality for helpful and interactive synthetic intelligence methods. Just lately, a number of speech-language fashions (SpeechLMs) have been launched with a particular concentrate on enhancing their SQA efficiency. Nonetheless, an absence of managed ablations of pretraining information processing and curation makes it difficult to grasp what elements account for efficiency, regardless of substantial positive aspects from related research in different information modalities. On this work, we tackle this hole by conducting a data-centric exploration for pretraining SpeechLMs. We concentrate on three analysis questions basic to speech-language pretraining information: (1) the way to course of uncooked web-crawled audio content material for speech-text pretraining, (2) the way to assemble artificial pretraining datasets to enhance web-crawled information and (3) the way to interleave (textual content, audio) segments into coaching sequences. We apply the insights from our managed data-centric ablations to pretrain a 3.8B-parameter SpeechLM, referred to as SpeLangy, that outperforms fashions which can be as much as 3x bigger by 10.2% absolute efficiency. We hope our findings spotlight the impression of efficient information curation for speech-language pretraining and information future data-centric exploration in SpeechLMs.

    • † College of Cambridge
    • ‡ College of Tübingen
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    mAceReason-Math: A Dataset of Excessive-High quality Multilingual Math Issues Prepared For RLVR

    March 14, 2026

    P-EAGLE: Quicker LLM inference with Parallel Speculative Decoding in vLLM

    March 14, 2026

    We Used 5 Outlier Detection Strategies on a Actual Dataset: They Disagreed on 96% of Flagged Samples

    March 13, 2026
    Top Posts

    Seth Godin on Management, Vulnerability, and Making an Influence within the New World Of Work

    March 14, 2026

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025
    Don't Miss

    Seth Godin on Management, Vulnerability, and Making an Influence within the New World Of Work

    By Charlotte LiMarch 14, 2026

    http://visitors.libsyn.com/safe/futureofworkpodcast/Audio_45min_-_Seth_Godin_-_WITH_ADS.mp3 Would you like each day management insights, knowledge, and ideas? Subscribe to Nice Management On…

    mAceReason-Math: A Dataset of Excessive-High quality Multilingual Math Issues Prepared For RLVR

    March 14, 2026

    AMC Robotics and HIVE Announce Collaboration to Advance AI-Pushed Robotics Compute Infrastructure

    March 14, 2026

    Tremble Chatbot App Entry, Prices, and Characteristic Insights

    March 14, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.