Reinforcement Studying with Verifiable Rewards (RLVR) has been efficiently utilized to considerably enhance the capabilities of pretrained massive language fashions, particularly within the math and logic drawback domains. Nevertheless, present analysis and obtainable coaching datasets stay English-centric. Whereas multilingual coaching knowledge and benchmarks have been created up to now, they weren’t created with RLVR and present mannequin functionality in thoughts, and their stage of issue is commonly too low to supply acceptable coaching alerts for present fashions. To handle this hole, we offer mAceReason-Math, a dataset of high-quality translations of difficult math issues sourced from a corpus particularly curated for RLVR (AceReason-Math). We additional take particular care to scrub and enhance our translations, leading to a protection of 14 languages with greater than 10,000 samples per language. We launch the dataset to facilitate multilingual RLVR analysis and benchmarking within the analysis group.
- † Hasso Plattner Institute & ELLIS Unit Potsdam
- ** Work finished whereas at Apple
- ‡ Equal contribution

