Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

Authors: Congliang Chen, Li Shen, Fangyu Zou, Wei Liu

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental At last, we apply the generic Adam and mini-batch Adam with the sufficient condition for solving the counterexample and training several neural networks on various real-world datasets. Experimental results are exactly in accord with our theoretical analysis. 5. Experimental Results In this section, we experimentally validate the proposed sufficient condition by applying Generic Adam and RMSProp to solve the counterexample (Chen et al., 2018a) and to train Le Net (Le Cun et al., 1998) on the MNIST dataset (Le Cun et al., 2010) and Res Net (He et al., 2016) on the CIFAR100 dataset (Krizhevsky, 2009), respectively.
Researcher Affiliation Collaboration Congliang Chen EMAIL The Chinese University of Hong Kong, Shenzhen Li Shen EMAIL JD Explore Academy Fangyu Zou EMAIL Meta Wei Liu EMAIL Tencent
Pseudocode Yes Algorithm 1 Generic Adam Parameters: Set suitable base learning rate {αt}, momentum parameter {βt}, and exponential moving average parameter {θt}, respectively. Choose x1 Rd and set initial values m0 = 0 Rd and v0 = ϵ Rd. for t = 1, 2, . . . , T do Sample a stochastic gradient gt; for k = 1, 2, . . . , d do vt,k = θtvt 1,k + (1 θt)g2 t,k; mt,k = βtmt 1,k + (1 βt)gt,k; xt+1,k = xt,k αtmt,k/ vt,k;
Open Source Code No The text does not contain any explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository. The license information refers to the paper itself, not its code.
Open Datasets Yes Le Net (Le Cun et al., 1998) on the MNIST dataset (Le Cun et al., 2010) and Res Net (He et al., 2016) on the CIFAR100 dataset (Krizhevsky, 2009), respectively. Also, we applied mini-batch Adam to train a base model of Transformer XL (Dai et al., 2019) on the dataset Wiki Text-103 (Merity et al., 2016).
Dataset Splits Yes MNIST (Le Cun et al., 2010) is composed of ten classes of digits among {0, 1, 2, . . . , 9}, which includes 60,000 training examples and 10,000 validation examples. CIFAR-100 (Le Cun et al., 2010) is composed of 100 classes of 32x32 color images. Each class includes 6,000 images. Besides, these images are divided into 50,000 training examples and 10,000 validation examples.
Hardware Specification No The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU or CPU models. It only mentions general experimental settings.
Software Dependencies No The paper mentions common software tools and frameworks used in deep learning, but it does not specify version numbers for any of them, such as Python, PyTorch, or TensorFlow.
Experiment Setup Yes In the experiments, for Generic Adam, we set θ(r) t = 1 (0.001 + 0.999r)/tr with r {0, 0.25, 0.5, 0.75, 1} and βt = 0.9, respectively; for RMSProp, we set βt = 0 and θt = 1 1 t along with the parameter settings in Mukkamala and Hein (2017). For fairness, the base learning rates αt in Generic Adam, RMSProp, and AMSGrad are all set as 0.001/t. ... We use different batchsizes {32, 64, 128} to train networks. Besides, when training Res Net-18 on the CIFAR100 dataset, we use an ℓ2 regularization on weights, the coefficient of the regularization term to 5e-4. We use grid search in [1e-2, 5e-3, 1e-3, 5e-4, 1e-4] for αt with respect to test accuracy. In addition, when training Res Net-18 on the CIFAR100 dataset, αt will reduce to 0.2 αt every 19550 iterations (50 epochs for the 128 batchsize setting), which (learning rate decay) is commonly used in practice .