reproducibilityindex.ai

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

Authors: Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, Cengiz Pehlevan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide experiments demonstrating that residual architectures including convolutional Res Nets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and Image Net.
Researcher Affiliation	Academia	Blake Bordelon , Lorenzo Noci , Mufan (Bill) Li , Boris Hanin & Cengiz Pehlevan Harvard University ETH Z urich Princeton University
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	The paper mentions using 'the µP implementation of the µ-Readout layer, and optimizers (SGD, and Adam under µP parametrizstion) as in the released µP package' but does not explicitly state that the authors' own code for the methodology is open-source or provide a link.
Open Datasets	Yes	We provide experiments demonstrating that residual architectures including convolutional Res Nets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and Image Net.
Dataset Splits	No	The paper uses standard datasets like CIFAR-10 and ImageNet but does not explicitly provide specific training/validation/test splits (e.g., percentages or counts) within its text.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions 'Py Torch' and optimizers like 'SGD, and Adam' and refers to 'the released µP package' but does not specify version numbers for these software components.
Experiment Setup	Yes	Loss is plotted after 20 epochs on CIFAR-10. All the missing datapoints indicate that the corresponding run diverged. ... Vi Ts trained with Adam also exhibit learning rate transfer ... We fix the batch size to 64. Unless stated otherwise, all the hyperparameters that are not tuned are set to a default value: β, γ0, σℓ= 1. ... We train the residual convolutional model (Sec. A.1) at relatively large scale (up to almost a billion parameters), for 20 epochs on CIFAR-10 with fixed learning rate of 0.046 ... Training is performed with SGD with momentum 0.9 and batchsize 128.