reproducibilityindex.ai

Implicit Bias of AdamW: $\ell_∞$-Norm Constrained Optimization

Authors: Shuo Xie, Zhiyuan Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we run experiments to verify the theoretical claims. In Section 5.1, we show that the ℓ norm of iterates by Adam W can converge below 1 λ as shown in Theorem 1.1 even when the function is non-convex. In Section 5.2, we show that steepest descent w.r.t. ℓ norm works better than w.r.t. ℓ2 norm for a specific function, which has better properties under ℓ geometry.
Researcher Affiliation	Academia	1Toyota Technological Institute at Chicago, IL, the United States. Correspondence to: Shuo Xie <shuox@ttic.edu>, Zhiyuan Li <zhiyuanli@ttic.edu>.
Pseudocode	Yes	Algorithm 1 Adam with ℓ2 regularization and Adam with decoupled weight decay (Adam W)
Open Source Code	No	The paper does not provide any explicit statements or links indicating that the source code for their methodology is publicly available.
Open Datasets	Yes	We train a small two-layer transformer for language modeling task on the Penn Treebank dataset (PTB) (Marcus et al., 1993)
Dataset Splits	No	The paper mentions training on the Penn Treebank dataset but does not explicitly detail the train/validation/test splits, their sizes, or how they were derived.
Hardware Specification	Yes	The experiments are run on a single A4000 or a single A6000.
Software Dependencies	No	The paper mentions using 'PyTorch' for implementation but does not specify its version number or any other software dependencies with version details.
Experiment Setup	Yes	We train the model in full batch without dropout in order to get deterministic gradients and follow the constant learning rate setting for the total 12800 epochs. The learning rate η is 10^-3. For each setting of β1, β2, we use Adam and Adam W with weight decay coefficient λ = 1, 2 to compare the ℓ norm for iterates in each optimizer. We employ the standard implementation in Py Torch but set ϵ to be 10^-16 in Adam and Adam W rather than 0... Each run is repeated for 4 random seeds...