Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Adam-family Methods with Decoupled Weight Decay in Deep Learning
Authors: Kuangyu Ding, Nachuan Xiao, Kim-chuan Toh
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments demonstrate that Adam D outperforms Adam and is comparable to Adam W, in the aspects of both generalization performance and efficiency. [...] In this section, we conduct numerical experiments to demonstrate the effectiveness of Adam D in the context of image classification and language modeling tasks. |
| Researcher Affiliation | Academia | Kuangyu Ding EMAIL Edwardson School of Industrial Engineering Purdue University Nachuan Xiao EMAIL School of Data Science The Chinese University of Hong Kong, Shenzhen Kim-Chuan Toh EMAIL Department of Mathematics and Institute of Operations Research and Analytics National University of Singapore |
| Pseudocode | Yes | Algorithm 1 Adam with decoupled weight decay (Adam D) for nonsmooth problem (UOP). [...] Algorithm 2 Adam W (Loshchilov & Hutter, 2019). |
| Open Source Code | No | The paper does not explicitly state that source code for their methodology is released, nor does it provide a direct link to a code repository. It only mentions the implementation environment: "All experiments are conducted using an NVIDIA RTX 3090 Ti GPU and are implemented in Python 3.9 with Py Torch 1.12.0." |
| Open Datasets | Yes | Our image classification experiments include the deployment of well-established architectures, namely Resnet34 (He et al., 2016) and Densenet121 (Huang et al., 2018), to train the CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009). Our language modeling experiments focus on LSTM networks applied to the Penn Treebank dataset (Marcus et al., 1993). |
| Dataset Splits | Yes | In all our experiments on image classification, we train the models consistently for 200 epochs, employing a batch size of 128. At the 150th epoch, we reduce the step size by a factor of 0.1. [...] In all our language modeling experiments, we train our models for 200 epochs using a batch size of 128. We employ a step size reduction strategy that decreases the learning rate to 0.1 times its previous value twice during training, specifically at the 75th and 150th epochs. |
| Hardware Specification | Yes | All experiments are conducted using an NVIDIA RTX 3090 Ti GPU and are implemented in Python 3.9 with Py Torch 1.12.0. |
| Software Dependencies | Yes | All experiments are conducted using an NVIDIA RTX 3090 Ti GPU and are implemented in Python 3.9 with Py Torch 1.12.0. |
| Experiment Setup | Yes | In all our experiments on image classification, we train the models consistently for 200 epochs, employing a batch size of 128. At the 150th epoch, we reduce the step size by a factor of 0.1. [...] For the weight decay parameter, we consider values in Ï {5Ă10â3, 10â3, 5Ă10â4, 10â4}. By fixing Ï first, we ensure that all methods solve the same minimization problem. With Ï fixed, we then perform a grid search over the learning rate η for Adam D, Adam, and Adam W using η {5Ă10â5, 10â4, 5Ă10â4, 10â3, 5Ă10â3, 10â2, 5Ă10â2, 10â1}. Other parameters are set as follows: Adam/Adam W: We set Δ = 10â8, Ξk = 10â1 and ÎČ = 10â3 as the default setting in Pytorch. Adam D: We set Ξs = Ξ0 (log(s+2))â3/2, with s representing the epoch number. [...] Here, we set the initial momentum parameter to Ξ0 = 10â1, the second moment parameter to ÎČ = 10â3 and the regularization parameter to Δ = 10â8, which are the same as the default settings in Py Torch for Adam/Adam W. |