SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking
Authors: Chris Cundy, Stefano Ermon
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that empirically, Sequence Match training leads to improvements over MLE on text generation with language models and arithmetic. Finally we evaluate the empirical performance of Sequence Match-trained models, showing improved performance over the maximum likelihood objective in text generation and arithmetic. |
| Researcher Affiliation | Academia | Chris Cundy1 Stefano Ermon1 1Department of Computer Science, Stanford University {cundy, ermon}@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1: Training an autoregressive model against a Sequence Match objective |
| Open Source Code | No | The paper does not contain an explicit statement or direct link providing access to the source code for the methodology described in the paper. It mentions `openwebtext` as an open-sourced dataset but this is not the authors' own implementation code. |
| Open Datasets | Yes | The dataset is the arithmetic add-or-sub-in-base sub-task of the math-dataset (Saxton et al., 2018) [...] Sequences are drawn from the openwebtext dataset3, an open-sourced dataset similar to the training set for GPT-2 (Radford et al., 2018), 3https://github.com/jcpeterson/openwebtext |
| Dataset Splits | Yes | The accuracy is computed over a held-out test set of 200 questions |
| Hardware Specification | Yes | We train each model on four A4000 GPUs with 16GB VRAM each. |
| Software Dependencies | No | The paper mentions software like Llama2-7b, QLORA, PyTorch, and JAX but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | We use a QLORA r of 64 for all experiments, and an α of 16. For the Sequence Match models, we first train against the BC objective alone for k gradient steps, and then train against a convex combination of the SM loss: Ltotal = βLBC + (1 β)LSM, where β is annealed from 1 to 0.2 linearly over 2,000 gradient steps. For the arithmetic task k is 10,000. For the text generation task k is 1,000. We use a learning rate scheme consisting of a linear warmup from 0 to 2000 steps, followed by cosine decay. For the text-generation evaluation, we set the prompt length at 256. We then generate sequences of length 256. For the generation, we set the temperature to 1 and the top-p sampling to 1, with no top-k sampling. |