Language fashions should be tailored to grasp and comply with consumer directions. Reinforcement studying is broadly used to facilitate this — sometimes utilizing mounted standards similar to “helpfulness” and “harmfulness”. In our work, we as an alternative suggest utilizing versatile, instruction-specific standards as a way of broadening the influence that reinforcement studying can have in eliciting instruction following. We suggest “Reinforcement Studying from Guidelines Suggestions” (RLCF). From directions, we extract checklists and consider how properly responses fulfill every merchandise – utilizing each AI judges and specialised verifier packages – then mix these scores to compute rewards for RL. We evaluate RLCF with different alignment strategies utilized to a powerful instruction following mannequin (Qwen2.5-7B-Instruct) on 5 widely-studied benchmarks — RLCF is the one methodology to enhance efficiency on each benchmark, together with a 4-point enhance in arduous satisfaction price on FollowBench, a 6-point enhance on InFoBench, and a 3-point rise in win price on Enviornment-Onerous. These outcomes set up guidelines suggestions as a key software for bettering language fashions’ assist of queries that categorical a mess of wants.
- † Carnegie Mellon College
- ‡ Meta
- ** Work finished whereas at Apple

