Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

Authors: Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, Cengiz Pehlevan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide experiments demonstrating that residual architectures including convolutional Res Nets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and Image Net.
Researcher Affiliation Academia Blake Bordelon , Lorenzo Noci , Mufan (Bill) Li , Boris Hanin & Cengiz Pehlevan Harvard University ETH Z urich Princeton University
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper mentions using 'the µP implementation of the µ-Readout layer, and optimizers (SGD, and Adam under µP parametrizstion) as in the released µP package' but does not explicitly state that the authors' own code for the methodology is open-source or provide a link.
Open Datasets Yes We provide experiments demonstrating that residual architectures including convolutional Res Nets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and Image Net.
Dataset Splits No The paper uses standard datasets like CIFAR-10 and ImageNet but does not explicitly provide specific training/validation/test splits (e.g., percentages or counts) within its text.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'Py Torch' and optimizers like 'SGD, and Adam' and refers to 'the released µP package' but does not specify version numbers for these software components.
Experiment Setup Yes Loss is plotted after 20 epochs on CIFAR-10. All the missing datapoints indicate that the corresponding run diverged. ... Vi Ts trained with Adam also exhibit learning rate transfer ... We fix the batch size to 64. Unless stated otherwise, all the hyperparameters that are not tuned are set to a default value: β, γ0, σℓ= 1. ... We train the residual convolutional model (Sec. A.1) at relatively large scale (up to almost a billion parameters), for 20 epochs on CIFAR-10 with fixed learning rate of 0.046 ... Training is performed with SGD with momentum 0.9 and batchsize 128.