reproducibilityindex.ai

Can Large Language Models Reason about Program Invariants?

Authors: Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, Pengcheng Yin

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate our proposed invariant generation methods, we perform a series of experiments on programs obtained from competitive programming contests (Section 4.1). Our primary baseline is Daikon, which generates the ground-truth for training and evaluation by executing the programs on hundreds of possible inputs. ... We ﬁne-tune LMs using an initial learning rate of 0.001, and a cosine learning rate decay schedule for 20,000 steps on 64 TPU v4 cores. The batch size is 128. ... We report Jaccard similarity, precision, recall, and F1 score at level of invariants.
Researcher Affiliation	Collaboration	1Columbia University 2Google Research, Brain Team. Correspondence to: Kexin Pei <kpei@cs.columbia.edu>, David Bieber <dbieber@google.com>, Charles Sutton <charlessutton@google.com>.
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not provide an unambiguous statement or a direct link to the open-source code for the methodology described.
Open Datasets	Yes	We evaluate our models on the Java submissions in the Code Contests dataset (Li et al., 2022), which consists of millions of submissions to about four thousand distinct programming challenges; the dataset provides upwards of 200 inputs for each problem.
Dataset Splits	Yes	In total, the resulting Code Contests Java Invariants dataset includes 1,600,158 training, 86,346 validation, and 24,509 test examples.
Hardware Specification	Yes	We ﬁne-tune LMs using an initial learning rate of 0.001, and a cosine learning rate decay schedule for 20,000 steps on 64 TPU v4 cores.
Software Dependencies	No	The paper mentions tools like 'Daikon' but does not provide specific version numbers for any software dependencies or libraries required for replication.
Experiment Setup	Yes	We ﬁne-tune LMs using an initial learning rate of 0.001, and a cosine learning rate decay schedule for 20,000 steps on 64 TPU v4 cores. The batch size is 128.