reproducibilityindex.ai

How to Scale Your EMA

Authors: Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko, Eeshan Gunesh Dhekane, Xavier Suau Cuadros, Russell Webb

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate the EMA Scaling Rule in synthetic settings (Section 3.1) and realworld settings where the model EMA plays an increasingly significant role in optimization: i) where the model EMA is used during inference instead of the target model (Section 3.2); ii) pseudo-labeling, where the model EMA (teacher) follows the target model (student), and the student is optimized on a mixture of a) labeled data and b) data without labels, whose pseudolabels are produced by the teacher (Section 3.3); and iii) self-supervised learning, which is the same as the semi-supervised case, except there is no labeled data (Section 3.4).
Researcher Affiliation	Industry	{dbusbridge, jramapuram, p_ablin, antares, eeshan, xsuaucuadros, rwebb}@apple.com
Pseudocode	Yes	Algorithm 1 Stochastic Gradient Descent with Progressive Scaling
Open Source Code	No	The paper does not contain an unambiguous statement or a direct link to a source-code repository for the methodology described in the paper.
Open Datasets	Yes	Image Net1k (Russakovsky et al., 2014), Libri Speech (Panayotov et al., 2015), CIFAR10 (Krizhevsky et al., 2014)
Dataset Splits	Yes	The standard Libri Speech validation sets (dev-clean and dev-other) are used to tune all hyperparameters, as well as to select the best models.
Hardware Specification	Yes	All experiments conducted are using 80Gb A100s.
Software Dependencies	No	The paper mentions using "Jax" and "Py Torch" for implementation but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We present the base hyperparameters for training BYOL with a Res Net-18 backbone using SGD in Table 9. This recipe was developed by starting from a well-known BYOL Res Net-50 recipe (Grill et al., 2020), adapting the input augmentations for CIFAR10, and performing a search over learning rate choices for an SGD optimizer.