DocPrompting: Generating Code by Retrieving the Docs

Authors: Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao Jiang, Graham Neubig

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that Doc Prompting consistently improves NL-to-code models: Doc Prompting improves strong base models such as Code T5 by 2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based evaluation on the popular Python Co Na La benchmark; on a new Bash dataset tldr, Doc Prompting improves Code T5 and GPT-Neo-1.3B by up to absolute 6.9% exact match. 4 EXPERIMENTAL SETUP
Researcher Affiliation Collaboration Language Technologies Institute, Carnegie Mellon University, Inspired Cognition {shuyanzh,ualon,fangzhex,zhiruow,zhengbaj,gneubig}@cs.cmu.edu
Pseudocode No The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Data and code are available at https://github.com/shuyanzhou/docprompting. (Footnote 1) ... To these ends, we make all our code, data, and models publicly available.
Open Datasets Yes Data and code are available at https://github.com/shuyanzhou/docprompting. (Footnote 1) ... We constructed the training, development and the test set with completely disjoint commands to test the generalizability of a code generation model. The statistics of the tldr shell scripting benchmark: train 1315, dev 376, test 188. NL Bash pairs: train 6414, dev 1845, test 928, total 9187.
Dataset Splits Yes We constructed the training, development and the test set with completely disjoint commands to test the generalizability of a code generation model. The statistics of the tldr shell scripting benchmark: train 1315, dev 376, test 188. ... This re-split results in 2,135/201/543 examples in the training/development/test sets, respectively.
Hardware Specification Yes The training takes up to 15 hours on a single A6000 GPU. ... The training takes 8 hours on a single A6000 GPU.
Software Dependencies No The paper mentions using T5-base, Code T5-base, GPT-Neo models, Codex, and Elasticsearch, but it does not specify version numbers for these software components or any other key libraries like PyTorch or CUDA.
Experiment Setup Yes We finetune the model for 10 epochs with batch size of 512 and learning rate of 1e 5. ... We train our single-source generators for 20 epochs with learning rate 4e 5. We train our Fi D-based generators for 10000 steps. The doc length is set to 200, any further content will be truncated. We follow (Izacard and Grave, 2021) to set learning rate to 5e 5 with 2000 steps warmup and linear learning rate decay. The batch size is set to 8.