We current SWE-Fitness center, the primary setting for coaching real-world software program engineering (SWE) brokers. SWE-Fitness center comprises 2,438 real-world Python job situations, every comprising a codebase with an executable runtime setting, unit exams, and a job laid out in pure language. We use SWE-Fitness center to coach language mannequin based mostly SWE brokers, reaching as much as 19% absolute positive aspects in resolve charge on the favored SWE-Bench Verified and Lite check units. We additionally experiment with inference-time scaling by way of verifiers educated on agent trajectories sampled from SWE-Fitness center. When mixed with our fine-tuned SWE brokers, we obtain 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a brand new state-of-the-art for open-weight SWE brokers. To facilitate additional analysis, we publicly launch SWE-Fitness center, fashions, and agent trajectories.
- * Equal contribution
- † College of California, Berkeley
- ‡ Work accomplished whereas at Apple
- § College of Illinois Urbana-Champaign
- ¶ Carnegie Mellon College