Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits
Authors: Daniel Morales-Brotons, Thijs Vogels, Hadrien Hendrikx
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we present a systematic study of the Exponential Moving Average (EMA) of weights. We first explore the training dynamics of EMA, give guidelines for hyperparameter tuning, and highlight its good early performance, partly explaining its success as a teacher. We also observe that EMA requires less learning rate decay compared to SGD since averaging naturally reduces noise, introducing a form of implicit regularization. Through extensive experiments, we show that EMA solutions differ from last-iterate solutions. EMA models not only generalize better but also exhibit improved i) robustness to noisy labels, ii) prediction consistency, iii) calibration and iv) transfer learning. |
| Researcher Affiliation | Academia | Daniel Morales-Brotons EMAIL EPFL Thijs Vogels EPFL Hadrien Hendrikx EMAIL Centre Inria de l Univ. Grenoble Alpes, CNRS, LJK, Grenoble, France |
| Pseudocode | No | The paper describes the Exponential Moving Average (EMA) update rule mathematically in Equation (1) and elaborates on various methodologies in prose, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide a link to a code repository or mention code being available in supplementary materials. |
| Open Datasets | Yes | We perform experiments on several image classification datasets (CIFAR-10, CIFAR-100, Tiny-Image Net (Le & Yang, 2015)) with various network architectures... We perform experiments on the benchmarks of CIFAR-10N and CIFAR-100N (Wei et al., 2022) |
| Dataset Splits | Yes | We define random 80/20 splits of the training set for train and validation respectively and perform hyperparameter optimization on the validation set, including the early stopping epoch for EMA (without BN stats recomputation). Finally, we train on the full training data using the selected hyperparameters and evaluate on the test set. |
| Hardware Specification | No | The paper mentions the memory requirements for models like Res Net-50 (23.7M parameters) but does not provide specific details about the hardware (e.g., GPU models, CPU types) used to conduct the experiments. |
| Software Dependencies | No | The paper specifies that SGD with Nesterov momentum was used as the optimizer and mentions other training parameters, but it does not provide specific software libraries or platforms with version numbers (e.g., PyTorch 1.x, Python 3.x). |
| Experiment Setup | Yes | Our experimental set up follows these steps for hyperparameter tuning: ... We fix the number of training epochs, batch size and weight decay. As for the EMA, we search for the best decay by keeping 5 parallel EMAs with τ [0.968, 0.984, 0.992, 0.996, 0.998]. We warmup the EMA decay in the first steps as min(α, t+1/t+10). EMA sampling every of T = 16 steps (note that this affects the effective decay, see below). In Table 22 we include a summary of the hyperparamter configuration. ... Table 22: Setting: Value; Optimizer: SGD with Nesterov momentum; Momentum: 0.9; Learning rate: Tuned on validation set; Early stopping epochs: Tuned on validation set; Weight Decay: Res Net: 1e-4. Wide Res Net, VGG-16: 5e-4; Batch size: 128; Epochs: CIFAR-10/100: 200. Tiny-Image Net: 150; EMA decays: [0.968, 0.984, 0.992, 0.996, 0.998]; EMA sampling period: T = 16. |