Super Consistency of Neural Network Landscapes and Learning Rate Transfer

Authors: Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, Antonio Orvieto

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from Res Nets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on Wiki Text.
Researcher Affiliation Academia Lorenzo Noci 1 Alexandru Meterez 3 4 5 Thomas Hofmann 1 Antonio Orvieto 2 3 4 1ETH Zürich, 2ELLIS Tübingen, 3MPI for Intelligent Systems, 4Tübingen AI Center, 5Harvard University
Pseudocode No The paper does not contain clearly labeled pseudocode or algorithm blocks.
Open Source Code No We will release the code upone acceptance.
Open Datasets Yes We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from Res Nets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on Wiki Text. (...) we train a residual network on CIFAR-10 (a 10 classes image classification task) using cross-entropy loss
Dataset Splits No The paper mentions training on various datasets and uses terms like 'batch size' and 'epochs', implying data splits. However, it does not explicitly state the specific train/validation/test percentages or sample counts for the standard datasets used (e.g., CIFAR-10, ImageNet), nor does it provide external links to these splits. While a specific stratified subset of CIFAR-10 is mentioned for one experiment, its full train/val/test split details are not provided.
Hardware Specification Yes The experiments were ran on A100 and H100 GPUs, with 80GB VRAM.
Software Dependencies No Our implementation is based on the implementation provided by Yang et al. [6], with the addition of the residual scaling. This uses a different parametrization from the one reported in Table. 1 but equivalent dynamics, obtainable using their abc-rule". "The implementations of our models are done in Py Torch." No specific version numbers for PyTorch or other libraries are provided.
Experiment Setup Yes Figure 1: Other parameters: B = 128, epochs = 50. (...) Model: 3-layer Conv Net, τ = 0, η0 = 0.7 (optimal). Details in Sec. J. (...) Parameters: batch size= 128, epochs = 20 for the µP/NTP models and 10 for the random feature model, dataset: CIFAR-10, without data augmentation. (Figure 26 caption) (...) HPs: 2 layers, 2 heads, 20 epochs, batch size 512, 100 warmup steps, sequence length 35. (Figure 16 caption)