LEVER: Learning to Verify Language-to-Code Generation with Execution
Authors: Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-Tau Yih, Sida Wang, Xi Victoria Lin
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them. |
| Researcher Affiliation | Collaboration | 1Yale University 2Meta AI. Correspondence to: Ansong Ni <ansong.ni@yale.edu>, Xi Victoria Lin <victorialin@meta.com>, Sida I. Wang <sida@meta.com>. |
| Pseudocode | No | The paper describes its approach conceptually and mathematically with equations, but it does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures. |
| Open Source Code | Yes | We open-source our experiment code for reproducibility: https://github.com/niansong1996/lever. |
| Open Datasets | Yes | We conduct experiments on four language-to-code datasets across domains of semantic parsing, table QA, math reasoning and basic python programming. The main settings of these four datasets are shown in Table 1. |
| Dataset Splits | Yes | Table 1: Data Statistics # Train 7,000 # Dev 1,032 # Test 4,336 (for Spider, similar for others) |
| Hardware Specification | No | The paper mentions "GPU memory" (Section 3.4) but does not provide specific details such as GPU models, CPU models, or any other explicit hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using T5, RoBERTa models, Codex API, and Python bindings, but it does not provide specific version numbers for any software dependencies, libraries, or programming languages. |
| Experiment Setup | Yes | We set the temperature as T = 0.6 for Codex and T = 0.8 for In Coder and Code Gen, as the optimal temperatures for the best pass@k by referring to the original papers (Fried et al., 2022; Nijkamp et al., 2022). [...] Detailed batch sizes and downsampling factor can be found in Table 7 in the Appendix. |