Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Repair Is Nearly Generation: Multilingual Program Repair with LLMs

Authors: Harshit Joshi, José Cambronero Sanchez, Sumit Gulwani, Vu Le, Gust Verbruggen, Ivan Radiček

AAAI 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present the first results for such a multilingual repair engine by evaluating on 6 different languages and comparing performance to language-specific repair engines. We perform an extensive evaluation across six different languages, showing that multilingual repair with LLMCs is viable and can compete with or outperform language-specific repair engines.
Researcher Affiliation	Industry	Harshit Joshi1, Jos e Cambronero Sanchez2, Sumit Gulwani2, Vu Le2, Ivan Radiˇcek3, Gust Verbruggen4* 1 Microsoft, India 2 Microsoft, USA 3 Microsoft, Croatia 4 Microsoft, Belgium EMAIL
Pseudocode	No	The paper describes its approach conceptually and visually (Figure 1) but does not provide any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions releasing a benchmark set for Power Shell at 'https://github.com/microsoft/prose-benchmarks/' but does not provide a link or statement for the source code of the RING methodology itself.
Open Datasets	Yes	Excel We use a recently released dataset of 200 Excel repair tasks collected from Excel help forums (Bavishi et al. 2022). Python We evaluate RING on a random sample of 200 syntactically invalid Python code snippets from the dataset used by the SOTA syntax repair tool for Python: BIFI (Yasunaga and Liang 2021). We introduce Power Shell commands as a new application for last-mile repair and collect a benchmark set of 200 Power Shell commands from Stack Overflow, which we also release for future research1. (1https://github.com/microsoft/prose-benchmarks/)
Dataset Splits	Yes	Smart selection is done via leave-one-out. For languages with ground truth, all other tasks are the example bank for drawing shots. Since the C and Python datasets do not have ground truth pair, we sample an additional 400 programs from their corresponding datasets. We run the best RING configuration (without smart selection) on these 400 programs and pick those that do not raise any diagnostics error. These buggy/correct pairs form the example bank in C and Python.
Hardware Specification	No	The paper states 'We ran all Codex-related queries on August 9th 2022 using Open AI s public API for davinci-code-002 , with the exception of Powershell experiments which we ran on March 7th 2023.' but does not specify any particular hardware specifications (GPU/CPU models, memory, etc.).
Software Dependencies	No	The paper mentions software components like 'Open AI s public API for davinci-code-002', 'Pygments lexer', 'ESLint', and 'gcc', but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	All RING experiments are at 0.7 temperature. For all the experiments, we used ### as stop token and top p= 1.0.