Super Consistency of Neural Network Landscapes and Learning Rate Transfer
Authors: Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, Antonio Orvieto
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from Res Nets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on Wiki Text. |
| Researcher Affiliation | Academia | Lorenzo Noci 1 Alexandru Meterez 3 4 5 Thomas Hofmann 1 Antonio Orvieto 2 3 4 1ETH Zürich, 2ELLIS Tübingen, 3MPI for Intelligent Systems, 4Tübingen AI Center, 5Harvard University |
| Pseudocode | No | The paper does not contain clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We will release the code upone acceptance. |
| Open Datasets | Yes | We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from Res Nets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on Wiki Text. (...) we train a residual network on CIFAR-10 (a 10 classes image classification task) using cross-entropy loss |
| Dataset Splits | No | The paper mentions training on various datasets and uses terms like 'batch size' and 'epochs', implying data splits. However, it does not explicitly state the specific train/validation/test percentages or sample counts for the standard datasets used (e.g., CIFAR-10, ImageNet), nor does it provide external links to these splits. While a specific stratified subset of CIFAR-10 is mentioned for one experiment, its full train/val/test split details are not provided. |
| Hardware Specification | Yes | The experiments were ran on A100 and H100 GPUs, with 80GB VRAM. |
| Software Dependencies | No | Our implementation is based on the implementation provided by Yang et al. [6], with the addition of the residual scaling. This uses a different parametrization from the one reported in Table. 1 but equivalent dynamics, obtainable using their abc-rule". "The implementations of our models are done in Py Torch." No specific version numbers for PyTorch or other libraries are provided. |
| Experiment Setup | Yes | Figure 1: Other parameters: B = 128, epochs = 50. (...) Model: 3-layer Conv Net, τ = 0, η0 = 0.7 (optimal). Details in Sec. J. (...) Parameters: batch size= 128, epochs = 20 for the µP/NTP models and 10 for the random feature model, dataset: CIFAR-10, without data augmentation. (Figure 26 caption) (...) HPs: 2 layers, 2 heads, 20 epochs, batch size 512, 100 warmup steps, sequence length 35. (Figure 16 caption) |