reproducibilityindex.ai

LEVER: Learning to Verify Language-to-Code Generation with Execution

Authors: Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-Tau Yih, Sida Wang, Xi Victoria Lin

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.
Researcher Affiliation	Collaboration	1Yale University 2Meta AI. Correspondence to: Ansong Ni <ansong.ni@yale.edu>, Xi Victoria Lin <victorialin@meta.com>, Sida I. Wang <sida@meta.com>.
Pseudocode	No	The paper describes its approach conceptually and mathematically with equations, but it does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures.
Open Source Code	Yes	We open-source our experiment code for reproducibility: https://github.com/niansong1996/lever.
Open Datasets	Yes	We conduct experiments on four language-to-code datasets across domains of semantic parsing, table QA, math reasoning and basic python programming. The main settings of these four datasets are shown in Table 1.
Dataset Splits	Yes	Table 1: Data Statistics # Train 7,000 # Dev 1,032 # Test 4,336 (for Spider, similar for others)
Hardware Specification	No	The paper mentions "GPU memory" (Section 3.4) but does not provide specific details such as GPU models, CPU models, or any other explicit hardware specifications used for running the experiments.
Software Dependencies	No	The paper mentions using T5, RoBERTa models, Codex API, and Python bindings, but it does not provide specific version numbers for any software dependencies, libraries, or programming languages.
Experiment Setup	Yes	We set the temperature as T = 0.6 for Codex and T = 0.8 for In Coder and Code Gen, as the optimal temperatures for the best pass@k by referring to the original papers (Fried et al., 2022; Nijkamp et al., 2022). [...] Detailed batch sizes and downsampling factor can be found in Table 7 in the Appendix.