reproducibilityindex.ai

R-U-SURE? Uncertainty-Aware Code Suggestions By Maximizing Utility Across Random User Intents

Authors: Daniel D. Johnson, Daniel Tarlow, Christian Walder

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach by applying it to three developer assistance tasks, each of which is visualized in Figure 1. For all tasks, we generate suggestion prototypes and hypothetical intents using a 5B-parameter decoder-only LM trained on 105B tokens of permissively-licensed open-source code from Git Hub, and parse them into trees using an error-tolerant bracket-matching pseudo-parser (described in Appendix D). We compare our approach to a number of task-speciﬁc baselines, all of which build suggestions s S(g(1)) from the same suggestion prototype, and evaluate how well each method can predict the changes necessary to obtain the ﬁnal code state from the dataset, measured by our utility function as well as token accuracy.
Researcher Affiliation	Collaboration	1Google Research, Brain Team 2University of Toronto, Department of Computer Science.
Pseudocode	Yes	Algorithm 1 Sequence edit-distance utility u(g, s) ... Algorithm 2 Decision diagram for w(k)(b)
Open Source Code	Yes	We also release our implementation as an open-source library at https://github. com/google-research/r_u_sure.
Open Datasets	No	We use permissively licensed code from scientiﬁc computing repositories hosted on GITHUB. We assemble a collection of 5000 held-out code ﬁles for each of the languages JAVA, JAVASCRIPT, C++ and PYTHON
Dataset Splits	No	No explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) were provided for the evaluation dataset.
Hardware Specification	Yes	On a GCP n1-standard-8 virtual machine, we obtain the following results:
Software Dependencies	No	Our current implementation of R-U-SURE runs on the CPU using Numba (Lam et al., 2015)
Experiment Setup	Yes	For this task, we conﬁgure the utility function with a per-character utility of 1 per matched SURE token and α = 0.7 per matched UNSURE, and a per-character cost of 1 per deleted SURE and β = 0.3 per deleted UNSURE; this setting is such that tokens with a lower-than-70% chance of being kept are optimal to mark as UNSURE. (We vary these thresholds for the Pareto plot, by setting the UNSURE match utility to α = c and deletion cost to β = 1 c for varying c.) We also include a localization penalty of 5 per edit inside SURE regions, a penalty of 0.25 in UNSURE regions, and a penalty of 0.75 for starting a new UNSURE.