Not All Tokens Are What You Need for Pretraining
Authors: Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, yelong shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When continual pretraining on 15B Open Web Math corpus, RHO-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, RHO-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively matching Deep Seek Math with only 3% of the pretraining tokens. |
| Researcher Affiliation | Collaboration | Zhenghao Lin χϕ Zhibin Gou πϕ Yeyun Gong ϕ Xiao Liuϕ Yelong Shenϕ Ruochen Xuϕ Chen Lin χρ Yujiu Yang π Jian Jiaoϕ Nan Duanϕ Weizhu Chenϕ χXiamen University πTsinghua University ρShanghai AI Laboratory ϕMicrosoft |
| Pseudocode | No | The paper includes a pipeline diagram (Figure 4) but no structured pseudocode or clearly labeled algorithm block. |
| Open Source Code | No | Justification for Question 5 (Open access to data and code) in the NeurIPS Paper Checklist states: '[No] Justification: This may be temporary, and we are working hard to promote the process of open source.' |
| Open Datasets | Yes | For mathematical reasoning, we utilize the Open Web Math (OWM) dataset [Paster et al., 2023], which comprises approximately 14B tokens sourced from math-related web pages in the Common Crawl. In the general domain, we combine the Slim Pajama [Daria et al., 2023] and Star Coder Data [Li et al., 2023a] (both part of the Tinyllama corpus) with Open Web Math, training on a total of 80 billion tokens with a mix ratio of 6:3:1. |
| Dataset Splits | No | The paper mentions evaluating token-level loss using a 'validation set of approximately 320,000 tokens' in Section 2.1 for a specific experiment. However, it does not provide explicit train/validation/test splits (e.g., percentages or exact counts) for the main pretraining corpora like the 15B Open Web Math or 80B general tokens. |
| Hardware Specification | Yes | For the 1.1B model, we conducted our training on 32 H100 80G GPUs. |
| Software Dependencies | Yes | We use vllm (v0.3.2) [Kwon et al., 2023] to speed up inference. |
| Experiment Setup | Yes | For math pretraining, we continue pretraining on the Tinyllama-1.1B model [Zhang et al., 2024] and the Mistral-7B model [Jiang et al., 2023] with learning rates of 8e-5 and 2e-5, respectively. ... The batch size is uniformly set to 1M tokens for both domains. Regarding the token selection ratio, we use 60% for the Tinyllama-1.1B model and 70% for the Mistral-7B model. |