Can Exterior Validation Instruments Enhance Annotation High quality for LLM-as-a-Decide?

Pairwise preferences over mannequin responses are broadly collected to judge and supply suggestions to giant language fashions (LLMs). Given two different mannequin responses to the identical enter, a human or AI annotator selects the “higher” response. Such knowledge can present a suggestions sign in domains the place conventional hard-coded metrics are troublesome to acquire (e.g. high quality of a chat interactions), thereby serving to measure mannequin progress or mannequin fine-tuning (e.g., by way of reinforcement studying from human suggestions, RLHF). Nonetheless, for some domains it may be tough to acquire such pairwise comparisons in top quality – from people or AI. For instance, long-form responses with many (probably false) factual statements or complicated (probably incorrect) code signify vital challenges for each AI and human annotators. On this work, we discover augmenting customary AI annotator techniques with extra instruments to enhance efficiency on three difficult domains: long-form factual, math and code duties. We suggest a tool-using agentic system to reinforce current annotators to supply larger high quality suggestions on these domains. Our system makes use of web-search and code execution to floor its annotations primarily based on exterior validation, unbiased of the LLMs inside biases. We offer intensive experimental outcomes evaluating our technique throughout the three activity domains in addition to out-of-domain duties primarily based on RewardBench subsets, the place we intention to keep away from efficiency reductions. We share all code to copy the experiments as an open-source package deal.

* Work carried out whereas at Apple
† College of Cambridge

Main Menu

What's Hot

What OpenClaw Reveals In regards to the Subsequent Part of AI Brokers – O’Reilly

Robotic Discuss Episode 148 – Moral robotic behaviour, with Alan Winfield

GlassWorm Spreads through 72 Malicious Open VSX Extensions Hidden in Transitive Dependencies

Can Exterior Validation Instruments Enhance Annotation High quality for LLM-as-a-Decide?

What OpenClaw Reveals In regards to the Subsequent Part of AI Brokers – O’Reilly

mAceReason-Math: A Dataset of Excessive-High quality Multilingual Math Issues Prepared For RLVR

P-EAGLE: Quicker LLM inference with Parallel Speculative Decoding in vLLM

Evaluating the Finest AI Video Mills for Social Media

What OpenClaw Reveals In regards to the Subsequent Part of AI Brokers – O’Reilly

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

What OpenClaw Reveals In regards to the Subsequent Part of AI Brokers – O’Reilly

Robotic Discuss Episode 148 – Moral robotic behaviour, with Alan Winfield

GlassWorm Spreads through 72 Malicious Open VSX Extensions Hidden in Transitive Dependencies

Seth Godin on Management, Vulnerability, and Making an Influence within the New World Of Work

Main Menu

Subscribe to Updates

What's Hot

Can Exterior Validation Instruments Enhance Annotation High quality for LLM-as-a-Decide?

Related Posts