Can Large Language Models Reason about Program Invariants?

Authors: Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, Pengcheng Yin

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our proposed invariant generation methods, we perform a series of experiments on programs obtained from competitive programming contests (Section 4.1). Our primary baseline is Daikon, which generates the ground-truth for training and evaluation by executing the programs on hundreds of possible inputs. ... We fine-tune LMs using an initial learning rate of 0.001, and a cosine learning rate decay schedule for 20,000 steps on 64 TPU v4 cores. The batch size is 128. ... We report Jaccard similarity, precision, recall, and F1 score at level of invariants.
Researcher Affiliation Collaboration 1Columbia University 2Google Research, Brain Team. Correspondence to: Kexin Pei <kpei@cs.columbia.edu>, David Bieber <dbieber@google.com>, Charles Sutton <charlessutton@google.com>.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide an unambiguous statement or a direct link to the open-source code for the methodology described.
Open Datasets Yes We evaluate our models on the Java submissions in the Code Contests dataset (Li et al., 2022), which consists of millions of submissions to about four thousand distinct programming challenges; the dataset provides upwards of 200 inputs for each problem.
Dataset Splits Yes In total, the resulting Code Contests Java Invariants dataset includes 1,600,158 training, 86,346 validation, and 24,509 test examples.
Hardware Specification Yes We fine-tune LMs using an initial learning rate of 0.001, and a cosine learning rate decay schedule for 20,000 steps on 64 TPU v4 cores.
Software Dependencies No The paper mentions tools like 'Daikon' but does not provide specific version numbers for any software dependencies or libraries required for replication.
Experiment Setup Yes We fine-tune LMs using an initial learning rate of 0.001, and a cosine learning rate decay schedule for 20,000 steps on 64 TPU v4 cores. The batch size is 128.