The Implicit Bias of Adam on Separable Data
Authors: Chenyang Zhang, Difan Zou, Yuan Cao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we study the implicit bias of Adam in linear logistic regression. Specifically, we show that when the training data are linearly separable, the iterates of Adam converge towards a linear classifier that achieves the maximum ℓ -margin in direction. Notably, for a general class of diminishing learning rates, this convergence occurs within polynomial time. Our result shed light on the difference between Adam and (stochastic) gradient descent from a theoretical perspective. ... In this section, we conduct numerical experiments to verify our theoretical conclusions. |
| Researcher Affiliation | Academia | Chenyang Zhang Department of Statistics and Actuarial Science School of Computing and Data Science The University of Hong Kong chyzhang@connect.hku.hk Difan Zou Department of Computer Science School of Computing and Data Science & Institute of Data Science The University of Hong Kong dzou@cs.hku.hk Yuan Cao Department of Statistics and Actuarial Science School of Computing and Data Science & Department of Mathematics The University of Hong Kong yuancao@hku.hk |
| Pseudocode | No | The paper presents the Adam update rules as mathematical formulas (3.2)-(3.4) but does not include any clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: This paper focuses on theoretical analysis of the standard optimization algorithm Adam. While we include some simulation results, they are very simple and are irrelevant to the key theoretical contributions of this paper. |
| Open Datasets | No | We set the sample size n = 50, and dimension d = 50. Then the data set {(xi, yi)} are generated as follows: 1. xi, i [n] are independently generated from N(0, I). 2. yi, i [n] are independently generated from as +1 or 1 with equal probability. |
| Dataset Splits | No | The paper describes synthetic data generation and experiment setup, but does not specify any training/validation splits. It only mentions 'training data are linearly separable' but not how data is partitioned for validation purposes. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. In the NeurIPS checklist, the authors state: 'Answer: [No] Justification: We only present very simple simulation results and computational resources are not the focus.' |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks) used for conducting the experiments. |
| Experiment Setup | Yes | We set the sample size n = 50, and dimension d = 50. ... for gradient descent with momentum, we set the momentum parameter as β1 = 0.9, and for Adam, we set β1 = 0.9, β2 = 0.99. All optimization algorithms are initialized with standard Gaussian distribution, and are run for 10^6 iterations. ... We run experiments on Adam with learning rates ηt = Θ(t a) for a {0.3, 0.5, 0.7, 1}... |