Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adam-family Methods with Decoupled Weight Decay in Deep Learning

Authors: Kuangyu Ding, Nachuan Xiao, Kim-chuan Toh

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments demonstrate that Adam D outperforms Adam and is comparable to Adam W, in the aspects of both generalization performance and efficiency. [...] In this section, we conduct numerical experiments to demonstrate the effectiveness of Adam D in the context of image classification and language modeling tasks.
Researcher Affiliation Academia Kuangyu Ding EMAIL Edwardson School of Industrial Engineering Purdue University Nachuan Xiao EMAIL School of Data Science The Chinese University of Hong Kong, Shenzhen Kim-Chuan Toh EMAIL Department of Mathematics and Institute of Operations Research and Analytics National University of Singapore
Pseudocode Yes Algorithm 1 Adam with decoupled weight decay (Adam D) for nonsmooth problem (UOP). [...] Algorithm 2 Adam W (Loshchilov & Hutter, 2019).
Open Source Code No The paper does not explicitly state that source code for their methodology is released, nor does it provide a direct link to a code repository. It only mentions the implementation environment: "All experiments are conducted using an NVIDIA RTX 3090 Ti GPU and are implemented in Python 3.9 with Py Torch 1.12.0."
Open Datasets Yes Our image classification experiments include the deployment of well-established architectures, namely Resnet34 (He et al., 2016) and Densenet121 (Huang et al., 2018), to train the CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009). Our language modeling experiments focus on LSTM networks applied to the Penn Treebank dataset (Marcus et al., 1993).
Dataset Splits Yes In all our experiments on image classification, we train the models consistently for 200 epochs, employing a batch size of 128. At the 150th epoch, we reduce the step size by a factor of 0.1. [...] In all our language modeling experiments, we train our models for 200 epochs using a batch size of 128. We employ a step size reduction strategy that decreases the learning rate to 0.1 times its previous value twice during training, specifically at the 75th and 150th epochs.
Hardware Specification Yes All experiments are conducted using an NVIDIA RTX 3090 Ti GPU and are implemented in Python 3.9 with Py Torch 1.12.0.
Software Dependencies Yes All experiments are conducted using an NVIDIA RTX 3090 Ti GPU and are implemented in Python 3.9 with Py Torch 1.12.0.
Experiment Setup Yes In all our experiments on image classification, we train the models consistently for 200 epochs, employing a batch size of 128. At the 150th epoch, we reduce the step size by a factor of 0.1. [...] For the weight decay parameter, we consider values in σ {5×10−3, 10−3, 5×10−4, 10−4}. By fixing σ first, we ensure that all methods solve the same minimization problem. With σ fixed, we then perform a grid search over the learning rate η for Adam D, Adam, and Adam W using η {5×10−5, 10−4, 5×10−4, 10−3, 5×10−3, 10−2, 5×10−2, 10−1}. Other parameters are set as follows: Adam/Adam W: We set Δ = 10−8, Ξk = 10−1 and ÎČ = 10−3 as the default setting in Pytorch. Adam D: We set Ξs = Ξ0 (log(s+2))−3/2, with s representing the epoch number. [...] Here, we set the initial momentum parameter to Ξ0 = 10−1, the second moment parameter to ÎČ = 10−3 and the regularization parameter to Δ = 10−8, which are the same as the default settings in Py Torch for Adam/Adam W.