Fewer Truncations Improve Language Modeling

Authors: Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, Stefano Soatto

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results from both text and code pretraining show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.
Researcher Affiliation Industry 1AWS AI Labs. Correspondence to: Hantian Ding <dhantian@amazon.com>, Zijian Wang <zijwan@amazon.com>.
Pseudocode Yes Algorithm 1 First/Best-Fit-Decreasing
Open Source Code No The paper does not explicitly state that the source code for their methodology is made available or provide a link to a repository.
Open Datasets Yes We use two popular pre-training datasets in our study: the Falcon Refined Web dataset (Penedo et al., 2023) for text, and the Stack (Kocetkov et al., 2022) for code.
Dataset Splits Yes We report perplexity on validation set (PPL), and average performance in reading comprehension (RDC), natural language inference (NLI), context following (CTX), summarization (SUM, in ROUGE-2), and commonsense (CMS).
Hardware Specification Yes All models were trained on a cluster of 256 A100 GPUs.
Software Dependencies Yes We use Flash Attention2 (Dao, 2023) to accelerate training.
Experiment Setup Yes We use a learning rate of 3e-4 with a cosine learning rate scheduler, and warm up over the first 3,000 steps. The global batch size is 2M tokens.