Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Implicit Bias of AdamW: $\ell_∞$-Norm Constrained Optimization
Authors: Shuo Xie, Zhiyuan Li
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we run experiments to verify the theoretical claims. In Section 5.1, we show that the ℓ norm of iterates by Adam W can converge below 1 λ as shown in Theorem 1.1 even when the function is non-convex. In Section 5.2, we show that steepest descent w.r.t. ℓ norm works better than w.r.t. ℓ2 norm for a specific function, which has better properties under ℓ geometry. |
| Researcher Affiliation | Academia | 1Toyota Technological Institute at Chicago, IL, the United States. Correspondence to: Shuo Xie <EMAIL>, Zhiyuan Li <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Adam with ℓ2 regularization and Adam with decoupled weight decay (Adam W) |
| Open Source Code | No | The paper does not provide any explicit statements or links indicating that the source code for their methodology is publicly available. |
| Open Datasets | Yes | We train a small two-layer transformer for language modeling task on the Penn Treebank dataset (PTB) (Marcus et al., 1993) |
| Dataset Splits | No | The paper mentions training on the Penn Treebank dataset but does not explicitly detail the train/validation/test splits, their sizes, or how they were derived. |
| Hardware Specification | Yes | The experiments are run on a single A4000 or a single A6000. |
| Software Dependencies | No | The paper mentions using 'PyTorch' for implementation but does not specify its version number or any other software dependencies with version details. |
| Experiment Setup | Yes | We train the model in full batch without dropout in order to get deterministic gradients and follow the constant learning rate setting for the total 12800 epochs. The learning rate η is 10^-3. For each setting of β1, β2, we use Adam and Adam W with weight decay coefficient λ = 1, 2 to compare the ℓ norm for iterates in each optimizer. We employ the standard implementation in Py Torch but set ϵ to be 10^-16 in Adam and Adam W rather than 0... Each run is repeated for 4 random seeds... |