DocPrompting: Generating Code by Retrieving the Docs
Authors: Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao Jiang, Graham Neubig
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that Doc Prompting consistently improves NL-to-code models: Doc Prompting improves strong base models such as Code T5 by 2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based evaluation on the popular Python Co Na La benchmark; on a new Bash dataset tldr, Doc Prompting improves Code T5 and GPT-Neo-1.3B by up to absolute 6.9% exact match. 4 EXPERIMENTAL SETUP |
| Researcher Affiliation | Collaboration | Language Technologies Institute, Carnegie Mellon University, Inspired Cognition {shuyanzh,ualon,fangzhex,zhiruow,zhengbaj,gneubig}@cs.cmu.edu |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Data and code are available at https://github.com/shuyanzhou/docprompting. (Footnote 1) ... To these ends, we make all our code, data, and models publicly available. |
| Open Datasets | Yes | Data and code are available at https://github.com/shuyanzhou/docprompting. (Footnote 1) ... We constructed the training, development and the test set with completely disjoint commands to test the generalizability of a code generation model. The statistics of the tldr shell scripting benchmark: train 1315, dev 376, test 188. NL Bash pairs: train 6414, dev 1845, test 928, total 9187. |
| Dataset Splits | Yes | We constructed the training, development and the test set with completely disjoint commands to test the generalizability of a code generation model. The statistics of the tldr shell scripting benchmark: train 1315, dev 376, test 188. ... This re-split results in 2,135/201/543 examples in the training/development/test sets, respectively. |
| Hardware Specification | Yes | The training takes up to 15 hours on a single A6000 GPU. ... The training takes 8 hours on a single A6000 GPU. |
| Software Dependencies | No | The paper mentions using T5-base, Code T5-base, GPT-Neo models, Codex, and Elasticsearch, but it does not specify version numbers for these software components or any other key libraries like PyTorch or CUDA. |
| Experiment Setup | Yes | We finetune the model for 10 epochs with batch size of 512 and learning rate of 1e 5. ... We train our single-source generators for 20 epochs with learning rate 4e 5. We train our Fi D-based generators for 10000 steps. The doc length is set to 200, any further content will be truncated. We follow (Izacard and Grave, 2021) to set learning rate to 5e 5 with 2000 steps warmup and linear learning rate decay. The batch size is set to 8. |