Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Authors: Xinran Gu, Kaifeng Lyu, Jiazheng Li, Jingzhao Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. [...] Empirically, we validate this power-law relationship on both synthetic biographies and a set of real-world knowledge extracted from Wikipedia (Section 5).
Researcher Affiliation Academia Xinran Gu1,3 Kaifeng Lyu1 Jiazheng Li2,3 Jingzhao Zhang1,3 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2College of AI, Tsinghua University 3Shanghai Qizhi Institute EMAIL, EMAIL EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Estimate Threshold Popularity 1: Input: 2: x: A list of popularity values for each data point, where xi represents the popularity of the i-th data point. 3: y: A list of binary values indicating the correctness of the model s response, where yi = 1 if the model answers the i-the question correctly, and yi = 0 otherwise. 4: αtarget: The target accuracy. 5: Nfail: The maximum number of failures before termination, denoting the fault tolerance level. 6: Output: 7: Pthres: The threshold popularity. 8: Initialize correct count: sum correct 0 9: Initialize error count: e 0 10: Sort (x, y) by x in ascending order and store the indices in a list I. 11: Initialize loop variable j len(x) 1 12: Initialize flag counter flag 0 13: while j 0 do 14: k j 15: while k 0 and x Ik = x Ij do 16: k k 1 17: end while 18: for l = k + 1 to j do 19: i Il 20: sum correct sum correct + yi 21: end for 22: if sum correct len(x) k 1 < set threshold then 23: e e + 1 24: end if 25: if e = Nfail then 26: Return: x Ij {Return the threshold popularity} 27: end if 28: j k 29: end while 30: Return: x I0 {If no such point is found, return the smallest popularity value}
Open Source Code No We are currently preparing our codebase for public release and will make it available once the cleaning and documentation process is complete.
Open Datasets Yes More specifically, we study factual knowledge acquisition. We follow the approach of Allen-Zhu and Li [2024a] to curate a synthetic biography dataset... We then mix this synthetic biography dataset with large-scale web corpus Fine Web-Edu [Penedo et al., 2024] or the Pile [Gao et al., 2020] to create the pre-training mixture. [...] The Fine Web-Edu and Open Web Math datasets are under the ODC-BY License. The Pile dataset is under the MIT License.
Dataset Splits Yes We compute the validation loss on about 50M tokens on a holdout set from the Pile or Fine Web-Edu.
Hardware Specification Yes We train models of sizes 70M and 160M using 8 NVIDIA RTX 6000 Ada GPUs, while models of sizes 410M and 1B are trained using either 16 NVIDIA RTX 6000 Ada GPUs or 8 NVIDIA A100 GPUs.
Software Dependencies No Our experiments use the GPT-Neo X library [Andonian et al., 2023]. For all experiments, we set the batch size as 512 and the sequence length as 2048. [...] We employ the lm-eval-harness [Gao et al., 2024] codebase to evaluate the zero-shot performance on five downstream tasks
Experiment Setup Yes For all experiments, we set the batch size as 512 and the sequence length as 2048. For all the experiments in Section 3, we use a Warmup-Stable-Decay (WSD) learning rate schedule with a peak learning rate of 10 3. We allocate 160 steps for warmup and the final 10% steps for cooldown. We keep other hyperparameters consistent with those used in Pythia.