Implicit Bias of AdamW: $\ell_∞$-Norm Constrained Optimization

Authors: Shuo Xie, Zhiyuan Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we run experiments to verify the theoretical claims. In Section 5.1, we show that the ℓ norm of iterates by Adam W can converge below 1 λ as shown in Theorem 1.1 even when the function is non-convex. In Section 5.2, we show that steepest descent w.r.t. ℓ norm works better than w.r.t. ℓ2 norm for a specific function, which has better properties under ℓ geometry.
Researcher Affiliation Academia 1Toyota Technological Institute at Chicago, IL, the United States. Correspondence to: Shuo Xie <shuox@ttic.edu>, Zhiyuan Li <zhiyuanli@ttic.edu>.
Pseudocode Yes Algorithm 1 Adam with ℓ2 regularization and Adam with decoupled weight decay (Adam W)
Open Source Code No The paper does not provide any explicit statements or links indicating that the source code for their methodology is publicly available.
Open Datasets Yes We train a small two-layer transformer for language modeling task on the Penn Treebank dataset (PTB) (Marcus et al., 1993)
Dataset Splits No The paper mentions training on the Penn Treebank dataset but does not explicitly detail the train/validation/test splits, their sizes, or how they were derived.
Hardware Specification Yes The experiments are run on a single A4000 or a single A6000.
Software Dependencies No The paper mentions using 'PyTorch' for implementation but does not specify its version number or any other software dependencies with version details.
Experiment Setup Yes We train the model in full batch without dropout in order to get deterministic gradients and follow the constant learning rate setting for the total 12800 epochs. The learning rate η is 10^-3. For each setting of β1, β2, we use Adam and Adam W with weight decay coefficient λ = 1, 2 to compare the ℓ norm for iterates in each optimizer. We employ the standard implementation in Py Torch but set ϵ to be 10^-16 in Adam and Adam W rather than 0... Each run is repeated for 4 random seeds...