Pragmatic Code Autocomplete
Authors: Gabriel Poesia, Noah Goodman445-452
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our system first proposes a set of strings that can be abbreviated by the user. Using only 100 abbreviations, we observe that a corpus of Python code can be compressed by 15%, a number that can be improved even further by specializing the abbreviations to a particular code base. We then use a contextualized sequence-to-sequence model to rank potential expansions of inputs that include abbreviations. In an offline reconstruction task our model achieves accuracies ranging from 93% to 99%, depending on the programming language and user settings. The model is small enough to run on a commodity CPU in real-time. We evaluate the usability of our system in a user study, integrating it in Microsoft VSCode, a popular code text editor. We observe that our system performs well and is complementary to traditional autocomplete features. |
| Researcher Affiliation | Academia | Gabriel Poesia, Noah Goodman Department of Computer Science, Stanford University, Stanford, CA 94305 {poesia,ndgoodman}@stanford.edu |
| Pseudocode | Yes | Function Find Abbreviatable Set(D, n) F[ ] 0 foreach l D do foreach t tokenize(l) do F[t] F[t] + 1 end end T F.keys().sort(λt . (|t| 1) F[t]) return T [1...n] Algorithm 1: Finds a list of n tokens that maximize compression on a dataset D of lines of code, when those tokens are abbreviated to a single character. Function Cap Collisions(T , c) T [ ] for i 1 to |T | do if T .count(λt . t1 = T [i]1) < c then T .append(T [i]) end return T Algorithm 2: Removes elements from a list of tokens T so that the number of potential expansions for any abbreviation is limited by c. This allows users to smoothly trade-off compression for higher accuracy. |
| Open Source Code | Yes | Code is available at https://github.com/gpoesia/magicomplete |
| Open Datasets | Yes | To train and evaluate our models, we collected a dataset of code from open source repositories in Python, Java and Java Script, the 3 programming languages that were ranked as the most popular in the Stack Overflow Developer Survey 20204. We used the Github API to collect a list of 7 million public repositories. |
| Dataset Splits | Yes | We split these files into training/validation/test sets using a standard 80%/10%/10% split. |
| Hardware Specification | Yes | We ran all experiments on a server with an NVIDIA TITAN Xp GPU with 12GB of memory. |
| Software Dependencies | Yes | We implemented our algorithms and models using the PyTorch 1.6.0 library (Paszke et al. 2019). |
| Experiment Setup | Yes | Architecture details and parameters can be found in the Appendix. The neural model we described in Section 4 uses an LSTM with 512 hidden units for both the bidirectional and the decoder, and 50-dimensional character embeddings. We embed context into a 128-dimensional vector. All models were trained for 10 epochs, using the Adam optimizer with a learning rate of 0.0005 and β = (0.9, 0.999), and a batch size of 512. The learning rate was set by doing simple search over 10 i and 5 10 i for i {1, 2, 3, 4, 5}, using validation accuracy to pick the value we used in later experiments. |