Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Repo2Run: Automated Building Executable Environment for Code Repository at Scale

Authors: Ruida Hu, Chao Peng, XinchenWang, Junjielong Xu, Cuiyun Gao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We created a benchmark containing 420 Python repositories with unit tests for evaluation. The results illustrate that Repo2Run achieves an 86.0% success rate, outperforming SWE-agent by 77.0%. We evaluate the effectiveness of Repo2Run on 420 Python code repositories.
Researcher Affiliation Collaboration Ruida Hu1 Chao Peng2 Xinchen Wang1 Junjielong Xu2 Cuiyun Gao1 1Harbin Institute of Technology, Shenzhen 2Byte Dance, Beijing
Pseudocode No The paper describes the workflow of Repo2Run and its components using diagrams and descriptive text, but it does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes The resources of Repo2Run are available at https://github.com/bytedance/Repo2Run.
Open Datasets Yes To demonstrate Repo2Run’s effectiveness, we first created a benchmark of 420 latest Python repositories with unit tests from Git Hub in 2024... We provide the code and data in our anonymized repository.
Dataset Splits Yes To demonstrate Repo2Run’s effectiveness, we first created a benchmark of 420 latest Python repositories with unit tests from Git Hub in 2024... We randomly sample 50 repositories from our benchmark, successfully run their entire test suites, and manually record the pass rates for each method.
Hardware Specification No The paper states: 'We specify the LLM used in the experiments and report the temperature settings applied during training and evaluation.' This justification specifies the LLM model used but does not provide any concrete details about the underlying hardware (e.g., GPU models, CPU types, memory) used for running the experiments or the Repo2Run system itself.
Software Dependencies Yes As the popular option, we select gpt-4o-2024-05-13 for all experiments, with the temperature uniformly set to 0.2. Based on the latest data [14] from 2025, we select Python 3.10 as the default Docker base image due to its broadest adoption among Python versions.
Experiment Setup Yes As the popular option, we select gpt-4o-2024-05-13 for all experiments, with the temperature uniformly set to 0.2.