Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Predictable Scale (Part II) --- Farseer: A Refined Scaling Law in LLMs

Authors: Houyi Li, Wenzhen Zheng, Qiufeng Wang, Zhenyu Ding, Haoying Wang, Zili Wang, Shijie Xuyang, Ning DING, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our approach, we trained an extensive suite of approximately 1,000 LLMs across diverse scales and configurations, consuming roughly 3 million NVIDIA H100 GPU hours.
Researcher Affiliation Collaboration Houyi Li Fudan University & Step Fun, China Wenzhen Zheng Step Fun, China Qiufeng Wang Step Fun, China Zhenyu Ding Xi an Jiaotong University, China Haoying Wang Xi an Jiaotong University, China Zili Wang Step Fun, China Shijie Xuyang Fudan University & Step Fun, China Ning Ding Xi an Jiaotong University, China Shuigeng Zhou Fudan University, China Xiangyu Zhang Step Fun & Megvii Technology, China Daxin Jiang Step Fun, China
Pseudocode Yes Algorithm 1: Optimal Transformation Selection for Y = f(X) Input: Discrete data points (Xk, Yk) for k = 1, . . . , M. Input: Dictionary of candidate transformation functions G = {g(1), g(2), . . . }. // Each g G is a function, e.g., identity, logarithm. Output: Optimal transformations g Y , g X G. Output: Coefficients (a , b ) for the linear model g Y (Yk) a g X(Xk) + b . Output: Minimum residual sum of squares ℓmin. ... Algorithm 2: Differential Piecewise Fitting (with Stretched-Exponential Forms) Input: Loss Data points L(N, D), scale factor λ. Output: Parameters θ A = (a 1, b 1, α ), θ B = (a 2, b 2, β ), θ LN = (a 3, b 3, γ ). Output: Final fit L(N, D) exp(a 2N β + b 2)D exp(a 1Nα +b 1) + exp(a 3N γ + b 3).
Open Source Code Yes To foster further research, we are comprehensively open-sourcing all code, data, results 3, all training logs4, all models used in scaling law fitting 5. 3https://github.com/Farseer-Scaling-Law/Farseer
Open Datasets Yes To foster further research, we are comprehensively open-sourcing all code, data, results 3, all training logs4, all models used in scaling law fitting 5. 3https://github.com/Farseer-Scaling-Law/Farseer
Dataset Splits Yes For evaluation, we utilize a high-quality, specially constructed validation set containing 30 million tokens. This dataset is entirely separate from our training data, ensuring that all validation samples are unseen.
Hardware Specification Yes consuming roughly 3 million NVIDIA H100 GPU hours.
Software Dependencies No The paper mentions using the Adam W optimizer and a Byte Pair Encoding (BPE) tokenizer, but does not provide specific version numbers for these or any other software libraries or frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes Our model architecture design follows the Llama [39, 16] design, using the Adam W optimizer [29] with β values of [0.9, 0.95], an epsilon of 10 8, a weight decay of 0.1, and a gradient clipping norm of 1.0. We set the parameters N and D for these models using a geometric progression with a common ratio of 2, and specific details can be found in the Appendix F. A visualization of the experimental (N, D) grid is provided in Fig. 7 (blue circles). Our learning rate schedule includes a linear warmup for the first 2,000 steps, followed by cosine decay to 1 10 5 for the remainder of training. The model uses a fixed sequence length of 2,048 tokens.