Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    I Examined GPTGirlfriend for 30 Days: Right here’s what actually occurred

    August 7, 2025

    Contained in the disconnect on housing

    August 7, 2025

    Akamai Ghost Platform Flaw Permits Hidden Second Request Injection

    August 7, 2025
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Emerging Tech»Your AI fashions are failing in manufacturing—Here is how you can repair mannequin choice
    Emerging Tech

    Your AI fashions are failing in manufacturing—Here is how you can repair mannequin choice

    Sophia Ahmed WilsonBy Sophia Ahmed WilsonJune 4, 2025No Comments5 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Your AI fashions are failing in manufacturing—Here is how you can repair mannequin choice
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link

    Be part of our day by day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Be taught Extra


    Enterprises must know if the fashions that energy their purposes and brokers work in real-life situations. This sort of analysis can typically be complicated as a result of it’s onerous to foretell particular situations. A revamped model of the RewardBench benchmark appears to be like to provide organizations a greater concept of a mannequin’s real-life efficiency. 

    The Allen Institute of AI (Ai2) launched RewardBench 2, an up to date model of its reward mannequin benchmark, RewardBench, which they declare offers a extra holistic view of mannequin efficiency and assesses how fashions align with an enterprise’s targets and requirements. 

    Ai2 constructed RewardBench with classification duties that measure correlations by inference-time compute and downstream coaching. RewardBench primarily offers with reward fashions (RM), which may act as judges and consider LLM outputs. RMs assign a rating or a “reward” that guides reinforcement studying with human suggestions (RHLF).

    RewardBench 2 is right here! We took a very long time to study from our first reward mannequin analysis instrument to make one that’s considerably tougher and extra correlated with each downstream RLHF and inference-time scaling. pic.twitter.com/NGetvNrOQV

    — Ai2 (@allen_ai) June 2, 2025

    Nathan Lambert, a senior analysis scientist at Ai2, advised VentureBeat that the primary RewardBench labored as supposed when it was launched. Nonetheless, the mannequin setting quickly advanced, and so ought to its benchmarks. 

    “As reward fashions turned extra superior and use instances extra nuanced, we shortly acknowledged with the neighborhood that the primary model didn’t totally seize the complexity of real-world human preferences,” he mentioned. 

    Lambert added that with RewardBench 2, “we got down to enhance each the breadth and depth of analysis—incorporating extra various, difficult prompts and refining the methodology to mirror higher how people truly decide AI outputs in observe.” He mentioned the second model makes use of unseen human prompts, has a more difficult scoring setup and new domains. 

    Utilizing evaluations for fashions that consider

    Whereas reward fashions take a look at how nicely fashions work, it’s additionally essential that RMs align with firm values; in any other case, the fine-tuning and reinforcement studying course of can reinforce dangerous conduct, resembling hallucinations, scale back generalization, and rating dangerous responses too excessive.

    RewardBench 2 covers six completely different domains: factuality, exact instruction following, math, security, focus and ties.

    “Enterprises ought to use RewardBench 2 in two alternative ways relying on their software. In the event that they’re performing RLHF themselves, they need to undertake the most effective practices and datasets from main fashions in their very own pipelines as a result of reward fashions want on-policy coaching recipes (i.e. reward fashions that mirror the mannequin they’re making an attempt to coach with RL). For inference time scaling or information filtering, RewardBench 2 has proven that they will choose the most effective mannequin for his or her area and see correlated efficiency,” Lambert mentioned. 

    Lambert famous that benchmarks like RewardBench supply customers a method to consider the fashions they’re selecting primarily based on the “dimensions that matter most to them, quite than counting on a slim one-size-fits-all rating.” He mentioned the thought of efficiency, which many analysis strategies declare to evaluate, may be very subjective as a result of a great response from a mannequin extremely is dependent upon the context and targets of the person. On the identical time, human preferences get very nuanced. 

    Ai 2 launched the primary model of RewardBench in March 2024. On the time, the corporate mentioned it was the primary benchmark and leaderboard for reward fashions. Since then, a number of strategies for benchmarking and enhancing RM have emerged. Researchers at Meta’s FAIR got here out with reWordBench. DeepSeek launched a new method known as Self-Principled Critique Tuning for smarter and scalable RM. 

    Tremendous excited that our second reward mannequin analysis is out. It is considerably tougher, a lot cleaner, and nicely correlated with downstream PPO/BoN sampling.

    Completely happy hillclimbing!

    Enormous congrats to @saumyamalik44 who lead the mission with a complete dedication to excellence. https://t.co/c0b6rHTXY5

    — Nathan Lambert (@natolambert) June 2, 2025

    How fashions carried out

    Since RewardBench 2 is an up to date model of RewardBench, Ai2 examined each present and newly educated fashions to see in the event that they proceed to rank excessive. These included a wide range of fashions, resembling variations of Gemini, Claude, GPT-4.1, and Llama-3.1, together with datasets and fashions like Qwen, Skywork, and its personal Tulu. 

    The corporate discovered that bigger reward fashions carry out greatest on the benchmark as a result of their base fashions are stronger. Total, the strongest-performing fashions are variants of Llama-3.1 Instruct. By way of focus and security, Skywork information “is especially useful,” and Tulu did nicely on factuality. 

    Ai2 mentioned that whereas they imagine RewardBench 2 “is a step ahead in broad, multi-domain accuracy-based analysis” for reward fashions, they cautioned that mannequin analysis ought to be primarily used as a information to choose fashions that work greatest with an enterprise’s wants. 

    Every day insights on enterprise use instances with VB Every day

    If you wish to impress your boss, VB Every day has you lined. We provide the inside scoop on what firms are doing with generative AI, from regulatory shifts to sensible deployments, so you’ll be able to share insights for optimum ROI.

    Learn our Privateness Coverage

    Thanks for subscribing. Take a look at extra VB newsletters right here.

    An error occured.


    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Sophia Ahmed Wilson
    • Website

    Related Posts

    Learn how to flip your Instagram location on or off

    August 7, 2025

    A Single Poisoned Doc May Leak ‘Secret’ Knowledge By way of ChatGPT

    August 7, 2025

    Gartner’s AI Hype Cycle reveals which AI tech is peaking – however will it final?

    August 6, 2025
    Top Posts

    I Examined GPTGirlfriend for 30 Days: Right here’s what actually occurred

    August 7, 2025

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025
    Don't Miss

    I Examined GPTGirlfriend for 30 Days: Right here’s what actually occurred

    By Amelia Harper JonesAugust 7, 2025

    GPTGirlfriend is an uncensored AI companion platform that permits customers to create deeply customized chatbots—known…

    Contained in the disconnect on housing

    August 7, 2025

    Akamai Ghost Platform Flaw Permits Hidden Second Request Injection

    August 7, 2025

    Learn how to flip your Instagram location on or off

    August 7, 2025
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2025 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.