R2E: Turning any Github Repository into a Programming Agent Environment

Authors: Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, Ion Stoica

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Building a scalable and interactive testbed for evaluating general-purpose AI programming agents for real-world code has been challenging, particularly due to a lack of high-quality test suites available. In this paper, we present Repository to Environment (R2E), a framework that can turn any GITHUB repository into a test environment to evaluate the performance of code-generating systems, both static and interactive. Our results demonstrate that even when SOTA models cannot generate correct solutions with advanced prompting techniques, they can effectively use environment feedback highlighting the need to move from static functional coding to interactive programming paradigm.
Researcher Affiliation Academia Naman Jain * 1 Manish Shetty * 1 Tianjun Zhang 1 King Han 1 Koushik Sen 1 Ion Stoica 1 1University of California, Berkeley.
Pseudocode No The paper includes code snippets for examples but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes R2E code is available at https://r2e.dev/
Open Datasets Yes Using this framework, we construct R2E-Eval1 (Section 4), the first large-scale benchmark of real-world coding problems consisting of natural-language docstrings, repository contexts, and equivalence test harnesses. R2E code is available at https://r2e.dev/
Dataset Splits No The paper describes the R2E-Eval1 benchmark as a collection of problems for evaluation but does not specify train, validation, or test splits for this dataset.
Hardware Specification No The paper mentions using Docker images and the size of installed repositories but does not specify any particular hardware components like GPU models, CPU types, or memory used for experiments.
Software Dependencies No The paper mentions tools like 'pdm', 'pip install', 'pipreqs', and 'PYCG' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes To compute PASS@1, we generate 5 completions for each problem instance using each model. We use nucleus sampling with p = 0.95 and T = 0.2. [...] We sample 56 and 48 instances from our benchmark for GPT-4 and GPT-3.5-TURBO on which the models do not generate a correct solution [...] We consider the incorrect programs generated by the models as the initial programs and then provide the models with error feedback using the harness iteratively for 5 iterations.