Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
How to Scale Your EMA
Authors: Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko, Eeshan Gunesh Dhekane, Xavier Suau Cuadros, Russell Webb
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate the EMA Scaling Rule in synthetic settings (Section 3.1) and realworld settings where the model EMA plays an increasingly significant role in optimization: i) where the model EMA is used during inference instead of the target model (Section 3.2); ii) pseudo-labeling, where the model EMA (teacher) follows the target model (student), and the student is optimized on a mixture of a) labeled data and b) data without labels, whose pseudolabels are produced by the teacher (Section 3.3); and iii) self-supervised learning, which is the same as the semi-supervised case, except there is no labeled data (Section 3.4). |
| Researcher Affiliation | Industry | |
| Pseudocode | Yes | Algorithm 1 Stochastic Gradient Descent with Progressive Scaling |
| Open Source Code | No | The paper does not contain an unambiguous statement or a direct link to a source-code repository for the methodology described in the paper. |
| Open Datasets | Yes | Image Net1k (Russakovsky et al., 2014), Libri Speech (Panayotov et al., 2015), CIFAR10 (Krizhevsky et al., 2014) |
| Dataset Splits | Yes | The standard Libri Speech validation sets (dev-clean and dev-other) are used to tune all hyperparameters, as well as to select the best models. |
| Hardware Specification | Yes | All experiments conducted are using 80Gb A100s. |
| Software Dependencies | No | The paper mentions using "Jax" and "Py Torch" for implementation but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We present the base hyperparameters for training BYOL with a Res Net-18 backbone using SGD in Table 9. This recipe was developed by starting from a well-known BYOL Res Net-50 recipe (Grill et al., 2020), adapting the input augmentations for CIFAR10, and performing a search over learning rate choices for an SGD optimizer. |