LEVER: Learning to Verify Language-to-Code Generation with Execution

Authors: Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-Tau Yih, Sida Wang, Xi Victoria Lin

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.
Researcher Affiliation Collaboration 1Yale University 2Meta AI. Correspondence to: Ansong Ni <ansong.ni@yale.edu>, Xi Victoria Lin <victorialin@meta.com>, Sida I. Wang <sida@meta.com>.
Pseudocode No The paper describes its approach conceptually and mathematically with equations, but it does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures.
Open Source Code Yes We open-source our experiment code for reproducibility: https://github.com/niansong1996/lever.
Open Datasets Yes We conduct experiments on four language-to-code datasets across domains of semantic parsing, table QA, math reasoning and basic python programming. The main settings of these four datasets are shown in Table 1.
Dataset Splits Yes Table 1: Data Statistics # Train 7,000 # Dev 1,032 # Test 4,336 (for Spider, similar for others)
Hardware Specification No The paper mentions "GPU memory" (Section 3.4) but does not provide specific details such as GPU models, CPU models, or any other explicit hardware specifications used for running the experiments.
Software Dependencies No The paper mentions using T5, RoBERTa models, Codex API, and Python bindings, but it does not provide specific version numbers for any software dependencies, libraries, or programming languages.
Experiment Setup Yes We set the temperature as T = 0.6 for Codex and T = 0.8 for In Coder and Code Gen, as the optimal temperatures for the best pass@k by referring to the original papers (Fried et al., 2022; Nijkamp et al., 2022). [...] Detailed batch sizes and downsampling factor can be found in Table 7 in the Appendix.