Not All Tokens Are What You Need for Pretraining

Authors: Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, yelong shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When continual pretraining on 15B Open Web Math corpus, RHO-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, RHO-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively matching Deep Seek Math with only 3% of the pretraining tokens.
Researcher Affiliation Collaboration Zhenghao Lin χϕ Zhibin Gou πϕ Yeyun Gong ϕ Xiao Liuϕ Yelong Shenϕ Ruochen Xuϕ Chen Lin χρ Yujiu Yang π Jian Jiaoϕ Nan Duanϕ Weizhu Chenϕ χXiamen University πTsinghua University ρShanghai AI Laboratory ϕMicrosoft
Pseudocode No The paper includes a pipeline diagram (Figure 4) but no structured pseudocode or clearly labeled algorithm block.
Open Source Code No Justification for Question 5 (Open access to data and code) in the NeurIPS Paper Checklist states: '[No] Justification: This may be temporary, and we are working hard to promote the process of open source.'
Open Datasets Yes For mathematical reasoning, we utilize the Open Web Math (OWM) dataset [Paster et al., 2023], which comprises approximately 14B tokens sourced from math-related web pages in the Common Crawl. In the general domain, we combine the Slim Pajama [Daria et al., 2023] and Star Coder Data [Li et al., 2023a] (both part of the Tinyllama corpus) with Open Web Math, training on a total of 80 billion tokens with a mix ratio of 6:3:1.
Dataset Splits No The paper mentions evaluating token-level loss using a 'validation set of approximately 320,000 tokens' in Section 2.1 for a specific experiment. However, it does not provide explicit train/validation/test splits (e.g., percentages or exact counts) for the main pretraining corpora like the 15B Open Web Math or 80B general tokens.
Hardware Specification Yes For the 1.1B model, we conducted our training on 32 H100 80G GPUs.
Software Dependencies Yes We use vllm (v0.3.2) [Kwon et al., 2023] to speed up inference.
Experiment Setup Yes For math pretraining, we continue pretraining on the Tinyllama-1.1B model [Zhang et al., 2024] and the Mistral-7B model [Jiang et al., 2023] with learning rates of 8e-5 and 2e-5, respectively. ... The batch size is uniformly set to 1M tokens for both domains. Regarding the token selection ratio, we use 60% for the Tinyllama-1.1B model and 70% for the Mistral-7B model.