How to Scale Your EMA

Authors: Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko, Eeshan Gunesh Dhekane, Xavier Suau Cuadros, Russell Webb

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate the EMA Scaling Rule in synthetic settings (Section 3.1) and realworld settings where the model EMA plays an increasingly significant role in optimization: i) where the model EMA is used during inference instead of the target model (Section 3.2); ii) pseudo-labeling, where the model EMA (teacher) follows the target model (student), and the student is optimized on a mixture of a) labeled data and b) data without labels, whose pseudolabels are produced by the teacher (Section 3.3); and iii) self-supervised learning, which is the same as the semi-supervised case, except there is no labeled data (Section 3.4).
Researcher Affiliation Industry {dbusbridge, jramapuram, p_ablin, antares, eeshan, xsuaucuadros, rwebb}@apple.com
Pseudocode Yes Algorithm 1 Stochastic Gradient Descent with Progressive Scaling
Open Source Code No The paper does not contain an unambiguous statement or a direct link to a source-code repository for the methodology described in the paper.
Open Datasets Yes Image Net1k (Russakovsky et al., 2014), Libri Speech (Panayotov et al., 2015), CIFAR10 (Krizhevsky et al., 2014)
Dataset Splits Yes The standard Libri Speech validation sets (dev-clean and dev-other) are used to tune all hyperparameters, as well as to select the best models.
Hardware Specification Yes All experiments conducted are using 80Gb A100s.
Software Dependencies No The paper mentions using "Jax" and "Py Torch" for implementation but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We present the base hyperparameters for training BYOL with a Res Net-18 backbone using SGD in Table 9. This recipe was developed by starting from a well-known BYOL Res Net-50 recipe (Grill et al., 2020), adapting the input augmentations for CIFAR10, and performing a search over learning rate choices for an SGD optimizer.