SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking

Authors: Chris Cundy, Stefano Ermon

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that empirically, Sequence Match training leads to improvements over MLE on text generation with language models and arithmetic. Finally we evaluate the empirical performance of Sequence Match-trained models, showing improved performance over the maximum likelihood objective in text generation and arithmetic.
Researcher Affiliation Academia Chris Cundy1 Stefano Ermon1 1Department of Computer Science, Stanford University {cundy, ermon}@cs.stanford.edu
Pseudocode Yes Algorithm 1: Training an autoregressive model against a Sequence Match objective
Open Source Code No The paper does not contain an explicit statement or direct link providing access to the source code for the methodology described in the paper. It mentions `openwebtext` as an open-sourced dataset but this is not the authors' own implementation code.
Open Datasets Yes The dataset is the arithmetic add-or-sub-in-base sub-task of the math-dataset (Saxton et al., 2018) [...] Sequences are drawn from the openwebtext dataset3, an open-sourced dataset similar to the training set for GPT-2 (Radford et al., 2018), 3https://github.com/jcpeterson/openwebtext
Dataset Splits Yes The accuracy is computed over a held-out test set of 200 questions
Hardware Specification Yes We train each model on four A4000 GPUs with 16GB VRAM each.
Software Dependencies No The paper mentions software like Llama2-7b, QLORA, PyTorch, and JAX but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes We use a QLORA r of 64 for all experiments, and an α of 16. For the Sequence Match models, we first train against the BC objective alone for k gradient steps, and then train against a convex combination of the SM loss: Ltotal = βLBC + (1 β)LSM, where β is annealed from 1 to 0.2 linearly over 2,000 gradient steps. For the arithmetic task k is 10,000. For the text generation task k is 1,000. We use a learning rate scheme consisting of a linear warmup from 0 to 2000 steps, followed by cosine decay. For the text-generation evaluation, we set the prompt length at 256. We then generate sequences of length 256. For the generation, we set the temperature to 1 and the top-p sampling to 1, with no top-k sampling.