On the Implicit Bias of Adam

Authors: Matias D. Cattaneo, Jason Matthew Klusowski, Boris Shigida

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also conduct numerical experiments and discuss how the proven facts can influence generalization. and We provide numerical evidence consistent with our theoretical results by training various vision models on CIFAR10 using full-batch Adam.
Researcher Affiliation Academia 1Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ, USA. Correspondence to: Boris Shigida <bs1624@princeton.edu>.
Pseudocode No The paper provides mathematical definitions of algorithms and ODEs but does not include explicit pseudocode blocks or algorithms labeled as such.
Open Source Code Yes The code used for training the models is available at https://github.com/borshigida/ implicit-bias-of-adam.
Open Datasets Yes We train Resnet-50, CNNs and Vision Transformers (Dosovitskiy et al., 2020) on the CIFAR-10 dataset with full-batch Adam.
Dataset Splits No The paper mentions training on CIFAR-10 and evaluating test accuracy, but it does not explicitly describe train/validation/test dataset splits by percentages, counts, or by referring to a standard split.
Hardware Specification No The paper mentions 'Princeton Research Computing resources' but does not specify any particular GPU/CPU models, processor types, or memory details.
Software Dependencies No The paper does not provide specific version numbers for any key software components or libraries used.
Experiment Setup Yes Definition 1.1. The Adam algorithm (Kingma & Ba, 2015) is an optimization algorithm with numerical stability hyperparameter ε > 0, squared gradient momentum hyperparameter ρ (0, 1), gradient momentum hyperparameter β (0, 1), initialization θ(0) Rp, ν(0) = 0 Rp, m(0) = 0 Rp and the following update rule: for each n 0, j {1, . . . , p} ... Figures 4 and 5 also specify experimental hyperparameters like 'ε = 10^-8, β = 0.99' and 'ρ = 0.999, ε = 10^-8'.