R2E: Turning any Github Repository into a Programming Agent Environment
Authors: Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, Ion Stoica
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Building a scalable and interactive testbed for evaluating general-purpose AI programming agents for real-world code has been challenging, particularly due to a lack of high-quality test suites available. In this paper, we present Repository to Environment (R2E), a framework that can turn any GITHUB repository into a test environment to evaluate the performance of code-generating systems, both static and interactive. Our results demonstrate that even when SOTA models cannot generate correct solutions with advanced prompting techniques, they can effectively use environment feedback highlighting the need to move from static functional coding to interactive programming paradigm. |
| Researcher Affiliation | Academia | Naman Jain * 1 Manish Shetty * 1 Tianjun Zhang 1 King Han 1 Koushik Sen 1 Ion Stoica 1 1University of California, Berkeley. |
| Pseudocode | No | The paper includes code snippets for examples but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | R2E code is available at https://r2e.dev/ |
| Open Datasets | Yes | Using this framework, we construct R2E-Eval1 (Section 4), the first large-scale benchmark of real-world coding problems consisting of natural-language docstrings, repository contexts, and equivalence test harnesses. R2E code is available at https://r2e.dev/ |
| Dataset Splits | No | The paper describes the R2E-Eval1 benchmark as a collection of problems for evaluation but does not specify train, validation, or test splits for this dataset. |
| Hardware Specification | No | The paper mentions using Docker images and the size of installed repositories but does not specify any particular hardware components like GPU models, CPU types, or memory used for experiments. |
| Software Dependencies | No | The paper mentions tools like 'pdm', 'pip install', 'pipreqs', and 'PYCG' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | To compute PASS@1, we generate 5 completions for each problem instance using each model. We use nucleus sampling with p = 0.95 and T = 0.2. [...] We sample 56 and 48 instances from our benchmark for GPT-4 and GPT-3.5-TURBO on which the models do not generate a correct solution [...] We consider the incorrect programs generated by the models as the initial programs and then provide the models with error feedback using the harness iteratively for 5 iterations. |