Repair Is Nearly Generation: Multilingual Program Repair with LLMs

Authors: Harshit Joshi, José Cambronero Sanchez, Sumit Gulwani, Vu Le, Gust Verbruggen, Ivan Radiček

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present the first results for such a multilingual repair engine by evaluating on 6 different languages and comparing performance to language-specific repair engines. We perform an extensive evaluation across six different languages, showing that multilingual repair with LLMCs is viable and can compete with or outperform language-specific repair engines.
Researcher Affiliation Industry Harshit Joshi1, Jos e Cambronero Sanchez2*, Sumit Gulwani2*, Vu Le2*, Ivan Radiˇcek3*, Gust Verbruggen4* 1 Microsoft, India 2 Microsoft, USA 3 Microsoft, Croatia 4 Microsoft, Belgium {t-hjoshi, jcambronero, sumitg, levu, ivradice, gverbruggen}@microsoft.com
Pseudocode No The paper describes its approach conceptually and visually (Figure 1) but does not provide any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions releasing a benchmark set for Power Shell at 'https://github.com/microsoft/prose-benchmarks/' but does not provide a link or statement for the source code of the RING methodology itself.
Open Datasets Yes Excel We use a recently released dataset of 200 Excel repair tasks collected from Excel help forums (Bavishi et al. 2022). Python We evaluate RING on a random sample of 200 syntactically invalid Python code snippets from the dataset used by the SOTA syntax repair tool for Python: BIFI (Yasunaga and Liang 2021). We introduce Power Shell commands as a new application for last-mile repair and collect a benchmark set of 200 Power Shell commands from Stack Overflow, which we also release for future research1. (1https://github.com/microsoft/prose-benchmarks/)
Dataset Splits Yes Smart selection is done via leave-one-out. For languages with ground truth, all other tasks are the example bank for drawing shots. Since the C and Python datasets do not have ground truth pair, we sample an additional 400 programs from their corresponding datasets. We run the best RING configuration (without smart selection) on these 400 programs and pick those that do not raise any diagnostics error. These buggy/correct pairs form the example bank in C and Python.
Hardware Specification No The paper states 'We ran all Codex-related queries on August 9th 2022 using Open AI s public API for davinci-code-002 , with the exception of Powershell experiments which we ran on March 7th 2023.' but does not specify any particular hardware specifications (GPU/CPU models, memory, etc.).
Software Dependencies No The paper mentions software components like 'Open AI s public API for davinci-code-002', 'Pygments lexer', 'ESLint', and 'gcc', but does not provide specific version numbers for these dependencies.
Experiment Setup Yes All RING experiments are at 0.7 temperature. For all the experiments, we used ### as stop token and top p= 1.0.