Fewer Truncations Improve Language Modeling
Authors: Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, Stefano Soatto
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results from both text and code pretraining show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%. |
| Researcher Affiliation | Industry | 1AWS AI Labs. Correspondence to: Hantian Ding <dhantian@amazon.com>, Zijian Wang <zijwan@amazon.com>. |
| Pseudocode | Yes | Algorithm 1 First/Best-Fit-Decreasing |
| Open Source Code | No | The paper does not explicitly state that the source code for their methodology is made available or provide a link to a repository. |
| Open Datasets | Yes | We use two popular pre-training datasets in our study: the Falcon Refined Web dataset (Penedo et al., 2023) for text, and the Stack (Kocetkov et al., 2022) for code. |
| Dataset Splits | Yes | We report perplexity on validation set (PPL), and average performance in reading comprehension (RDC), natural language inference (NLI), context following (CTX), summarization (SUM, in ROUGE-2), and commonsense (CMS). |
| Hardware Specification | Yes | All models were trained on a cluster of 256 A100 GPUs. |
| Software Dependencies | Yes | We use Flash Attention2 (Dao, 2023) to accelerate training. |
| Experiment Setup | Yes | We use a learning rate of 3e-4 with a cosine learning rate scheduler, and warm up over the first 3,000 steps. The global batch size is 2M tokens. |