Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

Authors: Congliang Chen, Li Shen, Fangyu Zou, Wei Liu

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	At last, we apply the generic Adam and mini-batch Adam with the sufﬁcient condition for solving the counterexample and training several neural networks on various real-world datasets. Experimental results are exactly in accord with our theoretical analysis. 5. Experimental Results In this section, we experimentally validate the proposed sufﬁcient condition by applying Generic Adam and RMSProp to solve the counterexample (Chen et al., 2018a) and to train Le Net (Le Cun et al., 1998) on the MNIST dataset (Le Cun et al., 2010) and Res Net (He et al., 2016) on the CIFAR100 dataset (Krizhevsky, 2009), respectively.
Researcher Affiliation	Collaboration	Congliang Chen EMAIL The Chinese University of Hong Kong, Shenzhen Li Shen EMAIL JD Explore Academy Fangyu Zou EMAIL Meta Wei Liu EMAIL Tencent
Pseudocode	Yes	Algorithm 1 Generic Adam Parameters: Set suitable base learning rate {αt}, momentum parameter {βt}, and exponential moving average parameter {θt}, respectively. Choose x1 Rd and set initial values m0 = 0 Rd and v0 = ϵ Rd. for t = 1, 2, . . . , T do Sample a stochastic gradient gt; for k = 1, 2, . . . , d do vt,k = θtvt 1,k + (1 θt)g2 t,k; mt,k = βtmt 1,k + (1 βt)gt,k; xt+1,k = xt,k αtmt,k/ vt,k;
Open Source Code	No	The text does not contain any explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository. The license information refers to the paper itself, not its code.
Open Datasets	Yes	Le Net (Le Cun et al., 1998) on the MNIST dataset (Le Cun et al., 2010) and Res Net (He et al., 2016) on the CIFAR100 dataset (Krizhevsky, 2009), respectively. Also, we applied mini-batch Adam to train a base model of Transformer XL (Dai et al., 2019) on the dataset Wiki Text-103 (Merity et al., 2016).
Dataset Splits	Yes	MNIST (Le Cun et al., 2010) is composed of ten classes of digits among {0, 1, 2, . . . , 9}, which includes 60,000 training examples and 10,000 validation examples. CIFAR-100 (Le Cun et al., 2010) is composed of 100 classes of 32x32 color images. Each class includes 6,000 images. Besides, these images are divided into 50,000 training examples and 10,000 validation examples.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU or CPU models. It only mentions general experimental settings.
Software Dependencies	No	The paper mentions common software tools and frameworks used in deep learning, but it does not specify version numbers for any of them, such as Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	In the experiments, for Generic Adam, we set θ(r) t = 1 (0.001 + 0.999r)/tr with r {0, 0.25, 0.5, 0.75, 1} and βt = 0.9, respectively; for RMSProp, we set βt = 0 and θt = 1 1 t along with the parameter settings in Mukkamala and Hein (2017). For fairness, the base learning rates αt in Generic Adam, RMSProp, and AMSGrad are all set as 0.001/t. ... We use different batchsizes {32, 64, 128} to train networks. Besides, when training Res Net-18 on the CIFAR100 dataset, we use an ℓ2 regularization on weights, the coefﬁcient of the regularization term to 5e-4. We use grid search in [1e-2, 5e-3, 1e-3, 5e-4, 1e-4] for αt with respect to test accuracy. In addition, when training Res Net-18 on the CIFAR100 dataset, αt will reduce to 0.2 αt every 19550 iterations (50 epochs for the 128 batchsize setting), which (learning rate decay) is commonly used in practice .