Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Parallel Scaling Law for Language Models
Authors: Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, Zhongxin Liu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with P parallel streams is similar to scaling the parameters by O(log P) while showing superior inference efficiency. For example, PARSCALE can use up to 22 less memory increase and 6 less latency increase compared to parameter scaling that achieves the same performance improvement. We then carry out large-scale pre-training experiments on the Stack-V2 [58] and Pile [26] corpus, by ranging P from 1 to 8 and model parameters from 500M to 4.4B. We use the results to fit a new parallel scaling law that generalizes the Chinchilla scaling law, as depicted in Figure 1(2). It shows that parallelizing into P streams equates to scaling the model parameters by O(log P ). Results on comprehensive tasks corroborate this conclusion. Tables 2 and 3 illustrate the average performance on downstream tasks (coding tasks for Stack-V2-Python and general tasks for Pile) after pre-training, with comprehensive results in Appendix G. |
| Researcher Affiliation | Collaboration | Mouxiang Chen1,2 , Binyuan Hui2, , Zeyu Cui2, Jiaxi Yang2, Dayiheng Liu2, Jianling Sun1, Junyang Lin2, Zhongxin Liu1, 1Zhejiang University, 2Qwen Team, Alibaba Group EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and equations like Equation (1) and (2) but does not present a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Our code and 67 trained model checkpoints are publicly available at https://github.com/Qwen LM/Par Scale and https://huggingface.co/Par Scale. |
| Open Datasets | Yes | We primarily focus on the relationship between parallel scaling and parameter scaling. Therefore, we fix the training data size at 42 billion tokens without data repeat3. We introduce the results for more training tokens in the next section, and leave the impact of data scale on the scaling law for future work. Most of our settings follow existing works [63], detailed in Appendix C. Our pre-training is conducted on two widely utilized datasets: Stack-V2 (Python subset) [58] and Pile [26]. Pile serves as a general corpus aimed at enhancing common sense and memorization skills, while Stack-V2 focuses on code comprehension and reasoning skills. In this section, we explore how PARSCALE performs on a smaller dataset, Open Web Text [28], with repeating data. |
| Dataset Splits | Yes | To fit a parallel scaling law in practice, we pre-train Transformer language models with the Qwen-2.5 dense architecture and tokenizer [70] from scratch on the open-source corpus. We primarily focus on the relationship between parallel scaling and parameter scaling. Therefore, we fix the training data size at 42 billion tokens without data repeat. In the first phase, we do not employ the PARSCALE technique. We refer to the recipe proposed by Allal et al. [2] to construct our training data, which consists of 370B general data, 80B mathematics data, and 50B code data. We train the model for two epochs to consume 1T tokens. Among the general text, there are 345B from Fine Web-Edu [67] and 28B from Cosmopedia 2 [4]; the mathematics data includes 80B from Fine Math [2]; and the code data comprises 47B from Stack-V2-Python and 4B from Stack-Python-Edu. In the second phase, ... we increase the proportion of mathematics and code data, finally including a total of 7B general text data, 7B mathematics data, and 7B Stack-Python-Edu data. For Human Eval(+) [12] and MBPP(+) [3], we use the Eval Plus framework [53] for evaluation, where Pass@1 employs greedy decoding and Pass@10 employs a temperature of 0.8. For general tasks, we employ lm-eval harness [6] and report normalized accuracy when provided. The number of few-shot mostly follows existing research configurations. |
| Hardware Specification | No | The paper mentions 'GPU-friendly parallel computation' and analyzes 'GPU Memory (GB)' and 'Latency (s)' in Figure 4, but it does not specify any particular GPU model (e.g., NVIDIA A100, Tesla V100) or other hardware components used for the experiments. |
| Software Dependencies | No | Our training is based on Megatron-LM [79]. ... We utilize bfloat16 precision and the Adam optimizer [42], setting the epsilon to 1e-8, β1 to 0.9, and β2 to 0.95. ... We utilize the LBFGS algorithm [52] via Sci Py [88] to locate local minima of the objective. The paper mentions tools like Megatron-LM, Adam optimizer, and Sci Py but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We use a batch size of 1024 and a sequence length of 2048, resulting in 20K training steps. For other hyperparameters, the learning rate undergoes a linear warm-up over 2,000 steps, reaching a peak of 3 10 4 before decreasing to a minimum of 1 10 5 according to a cosine decay schedule. The models are trained using a batch size of 1,024 and sequence length of 2,048, alongside a Ro PE base of 10,000 [82]. We utilize bfloat16 precision and the Adam optimizer [42], setting the epsilon to 1e-8, β1 to 0.9, and β2 to 0.95. All parameters, including backbones and additional ones we ve introduced, are initialized with a Gaussian distribution having a standard deviation of 0.02. Furthermore, we maintain a dropout rate of 0, enforce a weight decay rate of 0.1, and clip gradients at 1.0. |