Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fewer Truncations Improve Language Modeling
Authors: Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, Stefano Soatto
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results from both text and code pretraining show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%. |
| Researcher Affiliation | Industry | 1AWS AI Labs. Correspondence to: Hantian Ding <EMAIL>, Zijian Wang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 First/Best-Fit-Decreasing |
| Open Source Code | No | The paper does not explicitly state that the source code for their methodology is made available or provide a link to a repository. |
| Open Datasets | Yes | We use two popular pre-training datasets in our study: the Falcon Refined Web dataset (Penedo et al., 2023) for text, and the Stack (Kocetkov et al., 2022) for code. |
| Dataset Splits | Yes | We report perplexity on validation set (PPL), and average performance in reading comprehension (RDC), natural language inference (NLI), context following (CTX), summarization (SUM, in ROUGE-2), and commonsense (CMS). |
| Hardware Specification | Yes | All models were trained on a cluster of 256 A100 GPUs. |
| Software Dependencies | Yes | We use Flash Attention2 (Dao, 2023) to accelerate training. |
| Experiment Setup | Yes | We use a learning rate of 3e-4 with a cosine learning rate scheduler, and warm up over the first 3,000 steps. The global batch size is 2M tokens. |