Repair Is Nearly Generation: Multilingual Program Repair with LLMs
Authors: Harshit Joshi, José Cambronero Sanchez, Sumit Gulwani, Vu Le, Gust Verbruggen, Ivan Radiček
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present the first results for such a multilingual repair engine by evaluating on 6 different languages and comparing performance to language-specific repair engines. We perform an extensive evaluation across six different languages, showing that multilingual repair with LLMCs is viable and can compete with or outperform language-specific repair engines. |
| Researcher Affiliation | Industry | Harshit Joshi1, Jos e Cambronero Sanchez2*, Sumit Gulwani2*, Vu Le2*, Ivan Radiˇcek3*, Gust Verbruggen4* 1 Microsoft, India 2 Microsoft, USA 3 Microsoft, Croatia 4 Microsoft, Belgium {t-hjoshi, jcambronero, sumitg, levu, ivradice, gverbruggen}@microsoft.com |
| Pseudocode | No | The paper describes its approach conceptually and visually (Figure 1) but does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions releasing a benchmark set for Power Shell at 'https://github.com/microsoft/prose-benchmarks/' but does not provide a link or statement for the source code of the RING methodology itself. |
| Open Datasets | Yes | Excel We use a recently released dataset of 200 Excel repair tasks collected from Excel help forums (Bavishi et al. 2022). Python We evaluate RING on a random sample of 200 syntactically invalid Python code snippets from the dataset used by the SOTA syntax repair tool for Python: BIFI (Yasunaga and Liang 2021). We introduce Power Shell commands as a new application for last-mile repair and collect a benchmark set of 200 Power Shell commands from Stack Overflow, which we also release for future research1. (1https://github.com/microsoft/prose-benchmarks/) |
| Dataset Splits | Yes | Smart selection is done via leave-one-out. For languages with ground truth, all other tasks are the example bank for drawing shots. Since the C and Python datasets do not have ground truth pair, we sample an additional 400 programs from their corresponding datasets. We run the best RING configuration (without smart selection) on these 400 programs and pick those that do not raise any diagnostics error. These buggy/correct pairs form the example bank in C and Python. |
| Hardware Specification | No | The paper states 'We ran all Codex-related queries on August 9th 2022 using Open AI s public API for davinci-code-002 , with the exception of Powershell experiments which we ran on March 7th 2023.' but does not specify any particular hardware specifications (GPU/CPU models, memory, etc.). |
| Software Dependencies | No | The paper mentions software components like 'Open AI s public API for davinci-code-002', 'Pygments lexer', 'ESLint', and 'gcc', but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | All RING experiments are at 0.7 temperature. For all the experiments, we used ### as stop token and top p= 1.0. |