Neural networks with late-phase weights

Authors: Johannes von Oswald, Seijin Kobayashi, Joao Sacramento, Alexander Meulemans, Christian Henning, Benjamin F Grewe

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that augmenting standard models with late-phase weights improves generalization in established benchmarks such as CIFAR-10/100, Image Net and enwik8. These findings are complemented with a theoretical analysis of a noisy quadratic problem which provides a simplified picture of the late phases of neural network learning.
Researcher Affiliation Academia Institute of Neuroinformatics University of Z urich and ETH Z urich Z urich, Switzerland {voswaldj,seijink,rjoao,ameulema,henningc,bgrewe}@ethz.ch
Pseudocode Yes Algorithm 1: Late-phase learning Require: Base weights θ, late-phase weight set Φ, dataset D, gradient scale factor γθ, loss L Require: Training iteration t > T0 for 1 k K do Mk Sample minibatch from D θk θ L(Mk, θ, φk) φk Uφ(φk, φk L(Mk, θ, φk)) θ Uθ(θ, γθ PK k=1 θk)
Open Source Code Yes We provide code to reproduce our experiments at https://github.com/seijin-kobayashi/ late-phase-weights
Open Datasets Yes To test the applicability of our method to more realistic problems, we next augment standard neural network models with late-phase weights and examine their performance on the CIFAR-10 and CIFAR-100 image classification benchmarks (Krizhevsky, 2009). ... train deep residual networks (He et al., 2016) and a densely-connected convolutional network (Dense Net; Huang et al., 2018) on the Image Net dataset (Russakovsky et al., 2015). ... experiments on the language modeling benchmark enwik8.
Dataset Splits No The paper mentions 'Validation set acc. (%) on Image Net' but does not provide specific details on how the training, validation, and test sets were split (e.g., exact percentages, sample counts, or explicit references to predefined validation splits for all datasets).
Hardware Specification Yes We used a single NVIDIA Ge Force 2080 Ti GPU for the experiment.
Software Dependencies Yes The result was computed in Python 3.7, using the automatic differentiation and GPU acceleration package Py Torch (version 1.4.0).
Experiment Setup Yes Throughout our CIFAR-10/100 experiments we set K = 10, use a fast base gradient scale factor of γθ = 1, and set our late-phase initialization hyperparameters to T0 = 120 (measured henceforth in epochs; T0 = 100 for SWA) and do not use initialization noise, σ0 = 0.