Pairwise preferences over mannequin responses are broadly collected to judge and supply suggestions to giant language fashions (LLMs). Given two different mannequin responses to the identical enter, a human or AI annotator selects the “higher” response. Such knowledge can present a suggestions sign in domains the place conventional hard-coded metrics are troublesome to acquire (e.g. high quality of a chat interactions), thereby serving to measure mannequin progress or mannequin fine-tuning (e.g., by way of reinforcement studying from human suggestions, RLHF). Nonetheless, for some domains it may be tough to acquire such pairwise comparisons in top quality – from people or AI. For instance, long-form responses with many (probably false) factual statements or complicated (probably incorrect) code signify vital challenges for each AI and human annotators. On this work, we discover augmenting customary AI annotator techniques with extra instruments to enhance efficiency on three difficult domains: long-form factual, math and code duties. We suggest a tool-using agentic system to reinforce current annotators to supply larger high quality suggestions on these domains. Our system makes use of web-search and code execution to floor its annotations primarily based on exterior validation, unbiased of the LLMs inside biases. We offer intensive experimental outcomes evaluating our technique throughout the three activity domains in addition to out-of-domain duties primarily based on RewardBench subsets, the place we intention to keep away from efficiency reductions. We share all code to copy the experiments as an open-source package deal.
- * Work carried out whereas at Apple
- † College of Cambridge