Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking
Authors: Chris Cundy, Stefano Ermon
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that empirically, Sequence Match training leads to improvements over MLE on text generation with language models and arithmetic. Finally we evaluate the empirical performance of Sequence Match-trained models, showing improved performance over the maximum likelihood objective in text generation and arithmetic. |
| Researcher Affiliation | Academia | Chris Cundy1 Stefano Ermon1 1Department of Computer Science, Stanford University EMAIL |
| Pseudocode | Yes | Algorithm 1: Training an autoregressive model against a Sequence Match objective |
| Open Source Code | No | The paper does not contain an explicit statement or direct link providing access to the source code for the methodology described in the paper. It mentions `openwebtext` as an open-sourced dataset but this is not the authors' own implementation code. |
| Open Datasets | Yes | The dataset is the arithmetic add-or-sub-in-base sub-task of the math-dataset (Saxton et al., 2018) [...] Sequences are drawn from the openwebtext dataset3, an open-sourced dataset similar to the training set for GPT-2 (Radford et al., 2018), 3https://github.com/jcpeterson/openwebtext |
| Dataset Splits | Yes | The accuracy is computed over a held-out test set of 200 questions |
| Hardware Specification | Yes | We train each model on four A4000 GPUs with 16GB VRAM each. |
| Software Dependencies | No | The paper mentions software like Llama2-7b, QLORA, PyTorch, and JAX but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | We use a QLORA r of 64 for all experiments, and an α of 16. For the Sequence Match models, we first train against the BC objective alone for k gradient steps, and then train against a convex combination of the SM loss: Ltotal = βLBC + (1 β)LSM, where β is annealed from 1 to 0.2 linearly over 2,000 gradient steps. For the arithmetic task k is 10,000. For the text generation task k is 1,000. We use a learning rate scheme consisting of a linear warmup from 0 to 2000 steps, followed by cosine decay. For the text-generation evaluation, we set the prompt length at 256. We then generate sequences of length 256. For the generation, we set the temperature to 1 and the top-p sampling to 1, with no top-k sampling. |