Implicit Bias of AdamW: $\ell_∞$-Norm Constrained Optimization
Authors: Shuo Xie, Zhiyuan Li
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we run experiments to verify the theoretical claims. In Section 5.1, we show that the ℓ norm of iterates by Adam W can converge below 1 λ as shown in Theorem 1.1 even when the function is non-convex. In Section 5.2, we show that steepest descent w.r.t. ℓ norm works better than w.r.t. ℓ2 norm for a specific function, which has better properties under ℓ geometry. |
| Researcher Affiliation | Academia | 1Toyota Technological Institute at Chicago, IL, the United States. Correspondence to: Shuo Xie <shuox@ttic.edu>, Zhiyuan Li <zhiyuanli@ttic.edu>. |
| Pseudocode | Yes | Algorithm 1 Adam with ℓ2 regularization and Adam with decoupled weight decay (Adam W) |
| Open Source Code | No | The paper does not provide any explicit statements or links indicating that the source code for their methodology is publicly available. |
| Open Datasets | Yes | We train a small two-layer transformer for language modeling task on the Penn Treebank dataset (PTB) (Marcus et al., 1993) |
| Dataset Splits | No | The paper mentions training on the Penn Treebank dataset but does not explicitly detail the train/validation/test splits, their sizes, or how they were derived. |
| Hardware Specification | Yes | The experiments are run on a single A4000 or a single A6000. |
| Software Dependencies | No | The paper mentions using 'PyTorch' for implementation but does not specify its version number or any other software dependencies with version details. |
| Experiment Setup | Yes | We train the model in full batch without dropout in order to get deterministic gradients and follow the constant learning rate setting for the total 12800 epochs. The learning rate η is 10^-3. For each setting of β1, β2, we use Adam and Adam W with weight decay coefficient λ = 1, 2 to compare the ℓ norm for iterates in each optimizer. We employ the standard implementation in Py Torch but set ϵ to be 10^-16 in Adam and Adam W rather than 0... Each run is repeated for 4 random seeds... |