Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
Authors: Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, LINGMING ZHANG, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, Sida I. Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work studies improving LLMs reasoning ability through reinforcement learning on open software data. Therefore, theoretical results are not applicable here. Instead, we performed a comprehensive set of evaluations ( 3) in an empirical fashion. |
| Researcher Affiliation | Collaboration | 1Meta AI 2University of Illinois Urbana-Champaign 3Carnegie Mellon University |
| Pseudocode | No | The paper describes methods through narrative text, diagrams (Figure 1, Figure 2), and mathematical equations (Equation 1, Equation 2), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/facebookresearch/swe-rl. We included the reward implementation and evaluation code in the supplemental material, along with detailed instructions for using the artifact to ensure transparency and reproducibility of our results. |
| Open Datasets | Yes | SWE-bench Verified (Open AI, 2024), a human-verified collection of real-world Git Hub issues. ... GHArchive (Grigorik, 2025), which contains all activity events data from Git Hub. ... SWE-bench (Jimenez et al., 2023) a benchmark for solving real-world software issues |
| Dataset Splits | Yes | To prepare the initial dataset for RL, we extract 273k high-quality PR seeds from the raw PR dataset we collected from Git Hub. ... We conduct evaluation on SWE-bench Verified (Open AI, 2024), a subset of SWE-bench with 500 human-verified problems |
| Hardware Specification | Yes | We train our models on 512 NVIDIA H100 GPUs; a training run takes approximately 32 wall-time hours. |
| Software Dependencies | No | The paper mentions general tools and algorithms like 'Python s difflib.Sequence Matcher' and 'Group Relative Policy Optimization (GRPO)' but does not provide specific version numbers for these or any other software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Llama3-SWE-RL-70B is trained on top of Llama-3.3-70B-Instruct (Dubey et al., 2024) using SWE-RL for 1,600 steps with a 16k context window. We use a global batch size of 512, sampling 16 rollouts from each of the 32 problems in every batch. For every global step, a single optimization step is performed using Adam (Kingma and Ba, 2017). ... In the main evaluation ( 3.2), we generate 500 patches for each problem using a 1.0 temperature, and use the top 30 reproduction tests for execution and reranking. |