Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Repo2Run: Automated Building Executable Environment for Code Repository at Scale
Authors: Ruida Hu, Chao Peng, XinchenWang, Junjielong Xu, Cuiyun Gao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We created a benchmark containing 420 Python repositories with unit tests for evaluation. The results illustrate that Repo2Run achieves an 86.0% success rate, outperforming SWE-agent by 77.0%. We evaluate the effectiveness of Repo2Run on 420 Python code repositories. |
| Researcher Affiliation | Collaboration | Ruida Hu1 Chao Peng2 Xinchen Wang1 Junjielong Xu2 Cuiyun Gao1 1Harbin Institute of Technology, Shenzhen 2Byte Dance, Beijing |
| Pseudocode | No | The paper describes the workflow of Repo2Run and its components using diagrams and descriptive text, but it does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The resources of Repo2Run are available at https://github.com/bytedance/Repo2Run. |
| Open Datasets | Yes | To demonstrate Repo2Run’s effectiveness, we first created a benchmark of 420 latest Python repositories with unit tests from Git Hub in 2024... We provide the code and data in our anonymized repository. |
| Dataset Splits | Yes | To demonstrate Repo2Run’s effectiveness, we first created a benchmark of 420 latest Python repositories with unit tests from Git Hub in 2024... We randomly sample 50 repositories from our benchmark, successfully run their entire test suites, and manually record the pass rates for each method. |
| Hardware Specification | No | The paper states: 'We specify the LLM used in the experiments and report the temperature settings applied during training and evaluation.' This justification specifies the LLM model used but does not provide any concrete details about the underlying hardware (e.g., GPU models, CPU types, memory) used for running the experiments or the Repo2Run system itself. |
| Software Dependencies | Yes | As the popular option, we select gpt-4o-2024-05-13 for all experiments, with the temperature uniformly set to 0.2. Based on the latest data [14] from 2025, we select Python 3.10 as the default Docker base image due to its broadest adoption among Python versions. |
| Experiment Setup | Yes | As the popular option, we select gpt-4o-2024-05-13 for all experiments, with the temperature uniformly set to 0.2. |