RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Authors: Tianyang Liu, Canwen Xu, Julian McAuley
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To fill this gap, we introduce Repo Bench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. ... We conduct a series of experiments on Repo Bench, analyzing the efficacy of various retrieval methods and code completion models of different magnitudes, and the assessment of their combined performance in a full pipeline, providing some insights for future research and development. |
| Researcher Affiliation | Academia | Tianyang Liu Canwen Xu Julian Mc Auley University of California San Diego {til040, cxu, jmcauley}@ucsd.edu |
| Pseudocode | No | The paper describes the methodology and tasks in detail, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Repo Bench is actively maintained with the latest code, serving as a live benchmark publicly available at https://github.com/Leolty/repobench. |
| Open Datasets | Yes | Github-Code Dataset: The first source of Repo Bench is the github-code dataset2, which consists of a vast collection of code files sourced from Git Hub repositories under open-source licenses with a data cutoff date of March 16, 2022. ... Footnote 2: https://huggingface.co/datasets/codeparrot/github-code |
| Dataset Splits | No | The paper describes the construction of training and test datasets for the Repo Bench benchmark (Table 1 and Table 6) and mentions settings like XF-F, XF-R, and IF. However, it does not explicitly specify a distinct validation set split or its size/percentage for the experiments conducted within this paper. |
| Hardware Specification | No | The paper states, "Due to limited experimental resources and the extensive scale of our experiments, we rely on quantized models and libraries known for fast inference speeds," but does not provide specific details on the hardware (e.g., GPU models, CPU specifications, or memory) used for running the experiments. |
| Software Dependencies | No | All models (except codex) use CTranslate2 (Open NMT, 2023) for inference 7 and the model weights are sourced from Huggingface (Wolf et al., 2020). While it names software, it does not specify version numbers for CTranslate2, Huggingface libraries, Python, or other key dependencies needed for replication. |
| Experiment Setup | Yes | During inference for new token generation, all models are set with a temperature of 0.2 and a top p of 0.95, generating 64 tokens per next-line prediction, with the first non-comment line truncated as the output. ... The in-file context includes import statements and several preceding lines of code with a maximum limit of 30 lines. ... Repo Bench-C-2k... holds prompts that do not exceed 1,925 tokens. Concurrently, Repo Bench-C-8k is architected with a higher threshold, encompassing up to 7,685 tokens. ... We reserve 1,600 tokens for the in-file context, with a cropping limit of 60 preceding lines. Any unused tokens from this allocation are then filled by the cross-file context, up to a total prompt size of 6,400 tokens. |