Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Don't be lazy: CompleteP enables compute-efficient deep transformers

Authors: Nolan Dey, Bin Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, Joel Hestness

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental All experiments were run on Cerebras CS-3 systems. We compare transfer of learning rate and weight initialization standard deviation across depth (2-128 layers)... In Table 2 we evaluate the 20 TPP Pnon-emb=1.5B models at the minimum and optimal N :L settings, and confirm the upstream gains also translate to gains across five downstream tasks [95 100].
Researcher Affiliation Collaboration Nolan Dey Cerebras Systems Bin Claire Zhang Cerebras Systems Lorenzo Noci ETH Zurich Princeton University Mufan Li Princeton University Blake Bordelon Harvard University Shane Bergsma Cerebras Systems Cengiz Pehlevan Harvard University Kempner Institute Boris Hanin Princeton University Joel Hestness Cerebras Systems
Pseudocode No The paper describes the methodology and parameterizations in textual form and tables (e.g., Table 1), and includes mathematical derivations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes A minimal implementation is available at https://github.com/Eleuther AI/nano GPT-mup/tree/completep.
Open Datasets Yes We pretrain our models using an autoregressive loss (i.e. the next token prediction objective) on the Slim Pajama dataset [85] with a maximal sequence length of 2048 tokens using the GPT-2 tokenizer [79].
Dataset Splits Yes We pretrain our models using an autoregressive loss (i.e. the next token prediction objective) on the Slim Pajama dataset [85]... Slim PJ Validation Loss
Hardware Specification Yes All experiments were run on Cerebras CS-3 systems.
Software Dependencies No The paper mentions using specific algorithms and components like 'Adam W optimizer' and 'GPT-2 tokenizer' but does not provide specific version numbers for the underlying software libraries or frameworks (e.g., PyTorch, TensorFlow, or specific transformer libraries).
Experiment Setup Yes We train decoder-only Transformer language models [79] with prenormalization, untied embeddings, ALi Bi position embeddings [80] and Re LU2 nonlinearity [81, 82], using the Adam W optimizer with decoupled weight decay [83] with β1 = 0.9, β2 = 0.95, ϵ = 1e 16. The learning rate η schedule follows a linear warmup of min(10% of steps, 375M tokens), then a linear decay to zero [84]. All models use dhead = 64 for each attention head and feedforward dimension 4N.