reproducibilityindex.ai

MoMo: Momentum Models for Adaptive Learning Rates

Authors: Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, Robert M. Gower

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that Mo Mo and Mo Mo-Adam improve over SGD-M and Adam in terms of robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR, and Imagenet, for recommender systems on Criteo, for a transformer model on the translation task IWSLT14, and for a diffusion model.
Researcher Affiliation	Collaboration	1Department of Mathematics, Technical University of Munich, Munich 2Flatiron Institute, CCM, New York 3Meta AI, Fundamental AI Research (FAIR) team, New York.
Pseudocode	Yes	Algorithm 1 Mo Mo: Model-based Momentum method. Algorithm 2 Mo Mo-Adam: Adaptive learning rates for Adam Algorithm 3 Reset Star Algorithm 4 Estimate Star Algorithm 5 Mo Mo-Bias: Model-based Momentum with bias correction. Algorithm 6 Mo Mo : Adaptive learning rates and online estimation of f .
Open Source Code	No	An implementation of Mo Mo is available in Pytorch and optax. (Introduction). This statement is ambiguous regarding whether their specific code from this paper is released, or if MoMo as a concept is integrated into those libraries, without providing a direct link to their repository.
Open Datasets	Yes	Used for Res Net20 for CIFAR10 and Res Net110 for CIFAR100. DLRM for Criteo (Tien & Chapelle, 2014) IWSLT14 (Ott et al., 2019) UNet for Smithsonian Butterflies Vi T for Imagenet-1k (Dosovitskiy et al., 2021)
Dataset Splits	No	Figure 2 shows the final training loss and validation set accuracy... (Section 6.1.1). The paper reports results on validation sets but does not provide specific details on how the datasets were split into training, validation, or test sets (e.g., percentages or exact counts).
Hardware Specification	Yes	Unless specified otherwise, we train on a single NVIDIA A100 GPU. Vi T for Imagenet-1k 10 h (on four NVIDIA A100), Res Net18 for Imagenet32 20 hours (on NVIDIA V100)
Software Dependencies	No	An implementation of Mo Mo is available in Pytorch and optax. (Introduction). Model pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html Model https://github.com/facebookresearch/fairseq Model https://huggingface.co/docs/diffusers/main/en/api/models/unet2d Model timm/models/vision_transformer.py The paper mentions several software libraries and frameworks (Pytorch, optax, fairseq, Huggingface, timm) but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We use default choices for momentum parameter β = 0.9 for Mo Mo and SGD-M, and (β1, β2) = (0.9, 0.999) for Mo Mo-Adam and Adam respectively. In the experiments of this section, we always report averaged values over three seeds (five for DLRM), and do not use weight decay (λ = 0). We run 50 epochs for Res Net20 and 100 epochs for Res Net110, both with batch size 128.