MoMo: Momentum Models for Adaptive Learning Rates
Authors: Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, Robert M. Gower
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that Mo Mo and Mo Mo-Adam improve over SGD-M and Adam in terms of robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR, and Imagenet, for recommender systems on Criteo, for a transformer model on the translation task IWSLT14, and for a diffusion model. |
| Researcher Affiliation | Collaboration | 1Department of Mathematics, Technical University of Munich, Munich 2Flatiron Institute, CCM, New York 3Meta AI, Fundamental AI Research (FAIR) team, New York. |
| Pseudocode | Yes | Algorithm 1 Mo Mo: Model-based Momentum method. Algorithm 2 Mo Mo-Adam: Adaptive learning rates for Adam Algorithm 3 Reset Star Algorithm 4 Estimate Star Algorithm 5 Mo Mo-Bias: Model-based Momentum with bias correction. Algorithm 6 Mo Mo : Adaptive learning rates and online estimation of f . |
| Open Source Code | No | An implementation of Mo Mo is available in Pytorch and optax. (Introduction). This statement is ambiguous regarding whether *their* specific code from *this paper* is released, or if MoMo as a concept is integrated into those libraries, without providing a direct link to their repository. |
| Open Datasets | Yes | Used for Res Net20 for CIFAR10 and Res Net110 for CIFAR100. DLRM for Criteo (Tien & Chapelle, 2014) IWSLT14 (Ott et al., 2019) UNet for Smithsonian Butterflies Vi T for Imagenet-1k (Dosovitskiy et al., 2021) |
| Dataset Splits | No | Figure 2 shows the final training loss and validation set accuracy... (Section 6.1.1). The paper reports results on validation sets but does not provide specific details on how the datasets were split into training, validation, or test sets (e.g., percentages or exact counts). |
| Hardware Specification | Yes | Unless specified otherwise, we train on a single NVIDIA A100 GPU. Vi T for Imagenet-1k 10 h (on four NVIDIA A100), Res Net18 for Imagenet32 20 hours (on NVIDIA V100) |
| Software Dependencies | No | An implementation of Mo Mo is available in Pytorch and optax. (Introduction). Model pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html Model https://github.com/facebookresearch/fairseq Model https://huggingface.co/docs/diffusers/main/en/api/models/unet2d Model timm/models/vision_transformer.py The paper mentions several software libraries and frameworks (Pytorch, optax, fairseq, Huggingface, timm) but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We use default choices for momentum parameter β = 0.9 for Mo Mo and SGD-M, and (β1, β2) = (0.9, 0.999) for Mo Mo-Adam and Adam respectively. In the experiments of this section, we always report averaged values over three seeds (five for DLRM), and do not use weight decay (λ = 0). We run 50 epochs for Res Net20 and 100 epochs for Res Net110, both with batch size 128. |