Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit
Authors: Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, Cengiz Pehlevan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide experiments demonstrating that residual architectures including convolutional Res Nets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and Image Net. |
| Researcher Affiliation | Academia | Blake Bordelon , Lorenzo Noci , Mufan (Bill) Li , Boris Hanin & Cengiz Pehlevan Harvard University ETH Z urich Princeton University |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper mentions using 'the µP implementation of the µ-Readout layer, and optimizers (SGD, and Adam under µP parametrizstion) as in the released µP package' but does not explicitly state that the authors' own code for the methodology is open-source or provide a link. |
| Open Datasets | Yes | We provide experiments demonstrating that residual architectures including convolutional Res Nets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and Image Net. |
| Dataset Splits | No | The paper uses standard datasets like CIFAR-10 and ImageNet but does not explicitly provide specific training/validation/test splits (e.g., percentages or counts) within its text. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch' and optimizers like 'SGD, and Adam' and refers to 'the released µP package' but does not specify version numbers for these software components. |
| Experiment Setup | Yes | Loss is plotted after 20 epochs on CIFAR-10. All the missing datapoints indicate that the corresponding run diverged. ... Vi Ts trained with Adam also exhibit learning rate transfer ... We fix the batch size to 64. Unless stated otherwise, all the hyperparameters that are not tuned are set to a default value: β, γ0, σℓ= 1. ... We train the residual convolutional model (Sec. A.1) at relatively large scale (up to almost a billion parameters), for 20 epochs on CIFAR-10 with fixed learning rate of 0.046 ... Training is performed with SGD with momentum 0.9 and batchsize 128. |