reproducibilityindex.ai

Fewer Truncations Improve Language Modeling

Authors: Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, Stefano Soatto

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results from both text and code pretraining show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.
Researcher Affiliation	Industry	1AWS AI Labs. Correspondence to: Hantian Ding <dhantian@amazon.com>, Zijian Wang <zijwan@amazon.com>.
Pseudocode	Yes	Algorithm 1 First/Best-Fit-Decreasing
Open Source Code	No	The paper does not explicitly state that the source code for their methodology is made available or provide a link to a repository.
Open Datasets	Yes	We use two popular pre-training datasets in our study: the Falcon Refined Web dataset (Penedo et al., 2023) for text, and the Stack (Kocetkov et al., 2022) for code.
Dataset Splits	Yes	We report perplexity on validation set (PPL), and average performance in reading comprehension (RDC), natural language inference (NLI), context following (CTX), summarization (SUM, in ROUGE-2), and commonsense (CMS).
Hardware Specification	Yes	All models were trained on a cluster of 256 A100 GPUs.
Software Dependencies	Yes	We use Flash Attention2 (Dao, 2023) to accelerate training.
Experiment Setup	Yes	We use a learning rate of 3e-4 with a cosine learning rate scheduler, and warm up over the first 3,000 steps. The global batch size is 2M tokens.