On the Implicit Bias of Adam
Authors: Matias D. Cattaneo, Jason Matthew Klusowski, Boris Shigida
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also conduct numerical experiments and discuss how the proven facts can influence generalization. and We provide numerical evidence consistent with our theoretical results by training various vision models on CIFAR10 using full-batch Adam. |
| Researcher Affiliation | Academia | 1Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ, USA. Correspondence to: Boris Shigida <bs1624@princeton.edu>. |
| Pseudocode | No | The paper provides mathematical definitions of algorithms and ODEs but does not include explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | The code used for training the models is available at https://github.com/borshigida/ implicit-bias-of-adam. |
| Open Datasets | Yes | We train Resnet-50, CNNs and Vision Transformers (Dosovitskiy et al., 2020) on the CIFAR-10 dataset with full-batch Adam. |
| Dataset Splits | No | The paper mentions training on CIFAR-10 and evaluating test accuracy, but it does not explicitly describe train/validation/test dataset splits by percentages, counts, or by referring to a standard split. |
| Hardware Specification | No | The paper mentions 'Princeton Research Computing resources' but does not specify any particular GPU/CPU models, processor types, or memory details. |
| Software Dependencies | No | The paper does not provide specific version numbers for any key software components or libraries used. |
| Experiment Setup | Yes | Definition 1.1. The Adam algorithm (Kingma & Ba, 2015) is an optimization algorithm with numerical stability hyperparameter ε > 0, squared gradient momentum hyperparameter ρ (0, 1), gradient momentum hyperparameter β (0, 1), initialization θ(0) Rp, ν(0) = 0 Rp, m(0) = 0 Rp and the following update rule: for each n 0, j {1, . . . , p} ... Figures 4 and 5 also specify experimental hyperparameters like 'ε = 10^-8, β = 0.99' and 'ρ = 0.999, ε = 10^-8'. |