Convergence of Adam Under Relaxed Assumptions
Authors: Haochuan Li, Alexander Rakhlin, Ali Jadbabaie
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of O(ϵ 3). |
| Researcher Affiliation | Academia | Haochuan Li MIT haochuan@mit.edu Alexander Rakhlin MIT rakhlin@mit.edu Ali Jadbabaie MIT jadbabai@mit.edu |
| Pseudocode | Yes | Algorithm 1 ADAM |
| Open Source Code | No | The paper mentions 'Py Torch implementation' as a default choice for lambda, but does not provide a statement about releasing the authors' own code for the methodology or analysis described in this paper. |
| Open Datasets | Yes | Based on our preliminary experimental results on CIFAR-10 shown in Figure 1, the performance of Adam is not very sensitive to the choice of λ. |
| Dataset Splits | No | No specific dataset split information (exact percentages, sample counts, or detailed methodology) is provided for the CIFAR-10 dataset used in Figure 1. |
| Hardware Specification | No | No specific hardware details (exact GPU/CPU models, processor types, or memory amounts) are mentioned for the experiments, only general statements like 'training deep neural networks' and 'training transformers'. |
| Software Dependencies | No | The paper mentions 'Py Torch implementation' but does not specify any version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | Figure 1: Test errors of different models trained on CIFAR-10 using the Adam optimizer with β = 0.9, βsq = 0.999, η = 0.001 and different λs. |