Reinforcement studying has emerged as a robust paradigm for unlocking reasoning capabilities in giant language fashions. Nevertheless, counting on sparse rewards makes this course of extremely sample-inefficient, as fashions should navigate huge search areas with minimal suggestions. Whereas traditional curriculum studying goals to mitigate this by ordering knowledge based mostly on complexity, the correct ordering for a selected mannequin is commonly unclear. To deal with this, we suggest Goldilocks, a novel teacher-driven knowledge sampling technique that goals to foretell every query’s problem for the scholar mannequin. The instructor mannequin selects questions of acceptable problem for the scholar mannequin, i.e., questions which might be neither too simple nor too onerous (Goldilocks precept), whereas coaching the scholar with GRPO. By leveraging the scholar’s efficiency on seen samples, the instructor constantly adapts to the scholar’s evolving skills. On OpenMathReasoning dataset, Goldilocks knowledge sampling improves the efficiency of fashions educated with customary GRPO below the identical compute funds.
- † École Polytechnique Fédérale de Lausanne (EPFL), Switzerland

