How to Scale Your EMA
Authors: Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko, Eeshan Gunesh Dhekane, Xavier Suau Cuadros, Russell Webb
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate the EMA Scaling Rule in synthetic settings (Section 3.1) and realworld settings where the model EMA plays an increasingly significant role in optimization: i) where the model EMA is used during inference instead of the target model (Section 3.2); ii) pseudo-labeling, where the model EMA (teacher) follows the target model (student), and the student is optimized on a mixture of a) labeled data and b) data without labels, whose pseudolabels are produced by the teacher (Section 3.3); and iii) self-supervised learning, which is the same as the semi-supervised case, except there is no labeled data (Section 3.4). |
| Researcher Affiliation | Industry | {dbusbridge, jramapuram, p_ablin, antares, eeshan, xsuaucuadros, rwebb}@apple.com |
| Pseudocode | Yes | Algorithm 1 Stochastic Gradient Descent with Progressive Scaling |
| Open Source Code | No | The paper does not contain an unambiguous statement or a direct link to a source-code repository for the methodology described in the paper. |
| Open Datasets | Yes | Image Net1k (Russakovsky et al., 2014), Libri Speech (Panayotov et al., 2015), CIFAR10 (Krizhevsky et al., 2014) |
| Dataset Splits | Yes | The standard Libri Speech validation sets (dev-clean and dev-other) are used to tune all hyperparameters, as well as to select the best models. |
| Hardware Specification | Yes | All experiments conducted are using 80Gb A100s. |
| Software Dependencies | No | The paper mentions using "Jax" and "Py Torch" for implementation but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We present the base hyperparameters for training BYOL with a Res Net-18 backbone using SGD in Table 9. This recipe was developed by starting from a well-known BYOL Res Net-50 recipe (Grill et al., 2020), adapting the input augmentations for CIFAR10, and performing a search over learning rate choices for an SGD optimizer. |