Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits

Authors: Daniel Morales-Brotons, Thijs Vogels, Hadrien Hendrikx

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we present a systematic study of the Exponential Moving Average (EMA) of weights. We first explore the training dynamics of EMA, give guidelines for hyperparameter tuning, and highlight its good early performance, partly explaining its success as a teacher. We also observe that EMA requires less learning rate decay compared to SGD since averaging naturally reduces noise, introducing a form of implicit regularization. Through extensive experiments, we show that EMA solutions differ from last-iterate solutions. EMA models not only generalize better but also exhibit improved i) robustness to noisy labels, ii) prediction consistency, iii) calibration and iv) transfer learning.
Researcher Affiliation Academia Daniel Morales-Brotons EMAIL EPFL Thijs Vogels EPFL Hadrien Hendrikx EMAIL Centre Inria de l Univ. Grenoble Alpes, CNRS, LJK, Grenoble, France
Pseudocode No The paper describes the Exponential Moving Average (EMA) update rule mathematically in Equation (1) and elaborates on various methodologies in prose, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide a link to a code repository or mention code being available in supplementary materials.
Open Datasets Yes We perform experiments on several image classification datasets (CIFAR-10, CIFAR-100, Tiny-Image Net (Le & Yang, 2015)) with various network architectures... We perform experiments on the benchmarks of CIFAR-10N and CIFAR-100N (Wei et al., 2022)
Dataset Splits Yes We define random 80/20 splits of the training set for train and validation respectively and perform hyperparameter optimization on the validation set, including the early stopping epoch for EMA (without BN stats recomputation). Finally, we train on the full training data using the selected hyperparameters and evaluate on the test set.
Hardware Specification No The paper mentions the memory requirements for models like Res Net-50 (23.7M parameters) but does not provide specific details about the hardware (e.g., GPU models, CPU types) used to conduct the experiments.
Software Dependencies No The paper specifies that SGD with Nesterov momentum was used as the optimizer and mentions other training parameters, but it does not provide specific software libraries or platforms with version numbers (e.g., PyTorch 1.x, Python 3.x).
Experiment Setup Yes Our experimental set up follows these steps for hyperparameter tuning: ... We fix the number of training epochs, batch size and weight decay. As for the EMA, we search for the best decay by keeping 5 parallel EMAs with τ [0.968, 0.984, 0.992, 0.996, 0.998]. We warmup the EMA decay in the first steps as min(α, t+1/t+10). EMA sampling every of T = 16 steps (note that this affects the effective decay, see below). In Table 22 we include a summary of the hyperparamter configuration. ... Table 22: Setting: Value; Optimizer: SGD with Nesterov momentum; Momentum: 0.9; Learning rate: Tuned on validation set; Early stopping epochs: Tuned on validation set; Weight Decay: Res Net: 1e-4. Wide Res Net, VGG-16: 5e-4; Batch size: 128; Epochs: CIFAR-10/100: 200. Tiny-Image Net: 150; EMA decays: [0.968, 0.984, 0.992, 0.996, 0.998]; EMA sampling period: T = 16.