Think before you speak: Training Language Models With Pause Tokens

Authors: Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate pause-training on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains on our tasks when the model is both pretrained and finetuned with delays. For the 1B model, we witness gains on eight tasks, most prominently, a gain of 18% EM score on the QA task of SQu AD, 8% on Common Sense QA and 1% accuracy on the reasoning task of GSM8k.
Researcher Affiliation Collaboration Sachin Goyal Machine Learning Department Carnegie Mellon University sachingo@andrew.cmu.edu Ziwei Ji Google Research, NY ziweiji@google.com Ankit Singh Rawat Google Research, NY ankitsrawat@google.com Aditya Krishna Menon Google Research, NY adityakmenon@google.com Sanjiv Kumar Google Research, NY sanjivk@google.com Vaishnavh Nagarajan Google Research, NY vaishnavh@google.com Work done in part as a Student Researcher at Google.
Pseudocode Yes Algorithm 1: Pause-pretraining Algorithm 2: Pause-finetuning Stage 2: Finetuning with Pause Algorithm 3: Pause-inference Stage 3: Inference with Pause
Open Source Code No The paper does not contain an explicit statement or link providing access to the source code for the methodology described in the paper.
Open Datasets Yes Both the standard and pause models are pretrained on the C4 English mixture (Raffel et al., 2020), using the causal next token prediction objective for a total of 200B tokens (slightly more than 1 epoch on C4). We consider nine varied downstream tasks: (a) reasoning (GSM8k (Cobbe et al., 2021)), (b) extractive question answering (SQu AD (Rajpurkar et al., 2016), Co QA (Reddy et al., 2019)), (c) general understanding (Common Sense QA (Talmor et al., 2019), Physical IQA (Bisk et al., 2020)), (d) long term context recall (LAMBADA (Paperno et al., 2016)), (e) natural language inference (Hella Swag (Zellers et al., 2019)), and (f) fact recall (Web Questions (Berant et al., 2013), Natural Questions (Kwiatkowski et al., 2019)).
Dataset Splits No The paper mentions 'For all the downstream finetuning experiments, we report mean and standard deviation over 5 runs (with the randomness purely from the finetuning stage)' and 'We tune the learning rate and batch size', which implies a validation process. However, it does not explicitly provide specific train/validation/test dataset split percentages, absolute sample counts for each split, or references to predefined validation splits with citations.
Hardware Specification No The paper mentions using 'decoder-only models of size 1B and 130M' but does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or cloud computing environments with specifications.
Software Dependencies No The paper does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments.
Experiment Setup Yes We tune the learning rate and batch size for standard end-to-end training, and use the best hyperparameter for all other training variants as well. We share all the hyperparameters in Appendix H. Table 3: Downstream finetuning hyperparameters for the 1B model. Table 4: Downstream finetuning hyperparameters for the 130M model.