Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A thorough reproduction and evaluation of $\mu$P

Authors: Georgios Vlassis, David Belius, Volodymyr Fomichov

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper is an independent empirical reproduction of the claimed benefits of the µP parametrization proposed in Yang & Hu (2020) and Yang et al. (2021). ... We address this by independently reproducing the empirical claims of the original works. At the same time, we substantially increase the scale of the experiments, by training 16000 neural networks of sizes from 500 to 1B parameters, and empirically investigate µP s effect on outputs, gradient updates, weights, training loss and validation loss.
Researcher Affiliation Academia Georgios Vlassis EMAIL D-ITET ETH Zurich Volodymyr Fomichov EMAIL Faculty of Mathematics and Computer Science Uni Distance David Belius EMAIL Faculty of Mathematics and Computer Science Uni Distance
Pseudocode No The paper describes the methodology and experimental setup in detail but does not present any formal pseudocode or algorithm blocks.
Open Source Code Yes Our work is fundamentally an independent reproduction of Yang & Hu (2020) and Yang et al. (2021). Hence, we made every effort that our results are reproducible themselves. The complete repository of the training code2 is already available online. 2https://github.com/gvlassis/ant
Open Datasets Yes We experimented with four architectures, across five datasets. Specifically, we tested a 3-layer MLP on the California Housing and the MNIST datasets, a VGG11 CNN and a Vi T on CIFAR-10, and a transformer on Tiny Shakespeare and Wiki Text-103. ... The California Housing dataset (Pace & Barry, 1997) ... The MNIST dataset (Le Cun et al., 1998) ... The CIFAR-10 dataset (Krizhevsky et al., 2009) ... The Tiny Shakespeare dataset (Karpathy, 2015) ... The Wiki Text-103 dataset (Merity et al., 2016)
Dataset Splits Yes For our MLP on California Housing, ... held out 2000 for validation and 2000 for testing. ... The MNIST dataset ... We held out 10000 images for validation and 10000 for testing. ... The CIFAR-10 dataset ... We held out 10000 images for validation and 10000 for testing. ... The Tiny Shakespeare dataset ... We held out 25K tokens for validation and 25K for testing. ... The Wiki Text-103 dataset ... We held out 4K articles for validation and 4.5K for testing.
Hardware Specification Yes In total, we trained 15760 neural networks, spanning from 500 to 1B parameters, which needed 3200 hours when using an NVIDIA A100 3.
Software Dependencies No The paper mentions using PyTorch and Adam optimizer but does not provide specific version numbers for PyTorch or other libraries.
Experiment Setup Yes For all the experiments, we set the initialization scale c to 1/10 and used the Adam optimizer (Kingma & Ba, 2017) with Py Torch s defaults. Additionally, we trained without weight-decay or data augmentation. ... Each training run consisted of 50000 mini-batches of size 16. ... We varied ζ from ζ = 1 ... to ζ = 256 ... The remaining training details are the same as in Section 3.1, with the difference that here we used 20000 mini-batches. ... The stages had base width4 4, 8, 16 and 32 respectively. The classifier head had base width 20, and 0.5 dropout probability. ... Each training run consisted of 50000 mini-batches, of size 32. ... We used the Vi T architecture (Dosovitskiy et al., 2020) with a patch size of four and six blocks of base width 32, eight heads, expansion factor of one and 0.1 dropout probability. For positional embeddings we used sinusoidal positional encodings. ... We used the transformer architecture Vaswani et al. (2017) with a context of 128 tokens and six blocks of base width 32, eight heads, expansion factor of four and no dropout. ... Each training run consisted of 20000 mini-batches of size 32. ... The architecture used was a scaled-up version of Section 3.5, with the context doubled to 256 tokens, twelve transformer blocks with base width 144 and twelve attention heads. ... Each training run consisted of 8000 mini-batches, of size 512 (1B tokens). ... we used β2 = 0.95 for Adam.