R-U-SURE? Uncertainty-Aware Code Suggestions By Maximizing Utility Across Random User Intents

Authors: Daniel D. Johnson, Daniel Tarlow, Christian Walder

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach by applying it to three developer assistance tasks, each of which is visualized in Figure 1. For all tasks, we generate suggestion prototypes and hypothetical intents using a 5B-parameter decoder-only LM trained on 105B tokens of permissively-licensed open-source code from Git Hub, and parse them into trees using an error-tolerant bracket-matching pseudo-parser (described in Appendix D). We compare our approach to a number of task-specific baselines, all of which build suggestions s S(g(1)) from the same suggestion prototype, and evaluate how well each method can predict the changes necessary to obtain the final code state from the dataset, measured by our utility function as well as token accuracy.
Researcher Affiliation Collaboration 1Google Research, Brain Team 2University of Toronto, Department of Computer Science.
Pseudocode Yes Algorithm 1 Sequence edit-distance utility u(g, s) ... Algorithm 2 Decision diagram for w(k)(b)
Open Source Code Yes We also release our implementation as an open-source library at https://github. com/google-research/r_u_sure.
Open Datasets No We use permissively licensed code from scientific computing repositories hosted on GITHUB. We assemble a collection of 5000 held-out code files for each of the languages JAVA, JAVASCRIPT, C++ and PYTHON
Dataset Splits No No explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) were provided for the evaluation dataset.
Hardware Specification Yes On a GCP n1-standard-8 virtual machine, we obtain the following results:
Software Dependencies No Our current implementation of R-U-SURE runs on the CPU using Numba (Lam et al., 2015)
Experiment Setup Yes For this task, we configure the utility function with a per-character utility of 1 per matched SURE token and α = 0.7 per matched UNSURE, and a per-character cost of 1 per deleted SURE and β = 0.3 per deleted UNSURE; this setting is such that tokens with a lower-than-70% chance of being kept are optimal to mark as UNSURE. (We vary these thresholds for the Pareto plot, by setting the UNSURE match utility to α = c and deletion cost to β = 1 c for varying c.) We also include a localization penalty of 5 per edit inside SURE regions, a penalty of 0.25 in UNSURE regions, and a penalty of 0.75 for starting a new UNSURE.