Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Authors: Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, LINGMING ZHANG, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, Sida I. Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work studies improving LLMs reasoning ability through reinforcement learning on open software data. Therefore, theoretical results are not applicable here. Instead, we performed a comprehensive set of evaluations ( 3) in an empirical fashion.
Researcher Affiliation Collaboration 1Meta AI 2University of Illinois Urbana-Champaign 3Carnegie Mellon University
Pseudocode No The paper describes methods through narrative text, diagrams (Figure 1, Figure 2), and mathematical equations (Equation 1, Equation 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/facebookresearch/swe-rl. We included the reward implementation and evaluation code in the supplemental material, along with detailed instructions for using the artifact to ensure transparency and reproducibility of our results.
Open Datasets Yes SWE-bench Verified (Open AI, 2024), a human-verified collection of real-world Git Hub issues. ... GHArchive (Grigorik, 2025), which contains all activity events data from Git Hub. ... SWE-bench (Jimenez et al., 2023) a benchmark for solving real-world software issues
Dataset Splits Yes To prepare the initial dataset for RL, we extract 273k high-quality PR seeds from the raw PR dataset we collected from Git Hub. ... We conduct evaluation on SWE-bench Verified (Open AI, 2024), a subset of SWE-bench with 500 human-verified problems
Hardware Specification Yes We train our models on 512 NVIDIA H100 GPUs; a training run takes approximately 32 wall-time hours.
Software Dependencies No The paper mentions general tools and algorithms like 'Python s difflib.Sequence Matcher' and 'Group Relative Policy Optimization (GRPO)' but does not provide specific version numbers for these or any other software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes Llama3-SWE-RL-70B is trained on top of Llama-3.3-70B-Instruct (Dubey et al., 2024) using SWE-RL for 1,600 steps with a 16k context window. We use a global batch size of 512, sampling 16 rollouts from each of the 32 problems in every batch. For every global step, a single optimization step is performed using Adam (Kingma and Ba, 2017). ... In the main evaluation ( 3.2), we generate 500 patches for each problem using a 1.0 temperature, and use the top 30 reproduction tests for execution and reranking.