reproducibilityindex.ai

SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking

Authors: Chris Cundy, Stefano Ermon

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that empirically, Sequence Match training leads to improvements over MLE on text generation with language models and arithmetic. Finally we evaluate the empirical performance of Sequence Match-trained models, showing improved performance over the maximum likelihood objective in text generation and arithmetic.
Researcher Affiliation	Academia	Chris Cundy1 Stefano Ermon1 1Department of Computer Science, Stanford University {cundy, ermon}@cs.stanford.edu
Pseudocode	Yes	Algorithm 1: Training an autoregressive model against a Sequence Match objective
Open Source Code	No	The paper does not contain an explicit statement or direct link providing access to the source code for the methodology described in the paper. It mentions `openwebtext` as an open-sourced dataset but this is not the authors' own implementation code.
Open Datasets	Yes	The dataset is the arithmetic add-or-sub-in-base sub-task of the math-dataset (Saxton et al., 2018) [...] Sequences are drawn from the openwebtext dataset3, an open-sourced dataset similar to the training set for GPT-2 (Radford et al., 2018), 3https://github.com/jcpeterson/openwebtext
Dataset Splits	Yes	The accuracy is computed over a held-out test set of 200 questions
Hardware Specification	Yes	We train each model on four A4000 GPUs with 16GB VRAM each.
Software Dependencies	No	The paper mentions software like Llama2-7b, QLORA, PyTorch, and JAX but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup	Yes	We use a QLORA r of 64 for all experiments, and an α of 16. For the Sequence Match models, we first train against the BC objective alone for k gradient steps, and then train against a convex combination of the SM loss: Ltotal = βLBC + (1 β)LSM, where β is annealed from 1 to 0.2 linearly over 2,000 gradient steps. For the arithmetic task k is 10,000. For the text generation task k is 1,000. We use a learning rate scheme consisting of a linear warmup from 0 to 2000 steps, followed by cosine decay. For the text-generation evaluation, we set the prompt length at 256. We then generate sequences of length 256. For the generation, we set the temperature to 1 and the top-p sampling to 1, with no top-k sampling.