Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming

Authors: Hussein Mozannar, Gagan Bansal, Adam Fourney, Eric Horvitz

AAAI 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using data from 535 programmers, we perform a retrospective evaluation of CDHF and show that we can avoid displaying a significant fraction of suggestions that would have been rejected.
Researcher Affiliation	Collaboration	1Massachusetts Institute of Technology 2Microsoft Research EMAIL
Pseudocode	No	The paper does not contain any sections or blocks explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Code is available2 and additional details can be found in the appendix. (footnote 2: https://github.com/microsoft/coderec programming states)
Open Datasets	No	To build and evaluate our methods, we extract a large number of telemetry logs from Copilot users (mostly software engineers and researchers) at Microsoft. Programmers provided consent for the use of their data, and its use was approved by Microsoft s ethics advisory board.
Dataset Splits	Yes	We split the telemetry dataset in a 70:10:20 split for training, validation, and testing respectively.
Hardware Specification	No	The time to compute the features needed for the models and performing inference on a single data point can take 10ms with a GPU and less than 1ms on a CPU when omitting embeddings, in addition to latency of sending and receiving information between server and client. This mentions general 'GPU' and 'CPU' but does not provide specific model numbers or other detailed hardware specifications.
Software Dependencies	No	The paper mentions software like eXtreme Gradient Boosting (XGB), CodeBERT, and Tree-sitter Parser, but does not specify their version numbers, nor any programming language versions or other libraries with specific version details.
Experiment Setup	Yes	We set the thresholds t1, t2, tr on the validation set for CDHF and evaluate on the test set. ...Our proposed approach is as follows: Each time the programmer pauses typing, we decide using a predictor whether to show a suggestion. Crucially, we do this using a two-stage scheme...