Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration
Authors: Congliang Chen, Li Shen, Fangyu Zou, Wei Liu
JMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | At last, we apply the generic Adam and mini-batch Adam with the sufficient condition for solving the counterexample and training several neural networks on various real-world datasets. Experimental results are exactly in accord with our theoretical analysis. 5. Experimental Results In this section, we experimentally validate the proposed sufficient condition by applying Generic Adam and RMSProp to solve the counterexample (Chen et al., 2018a) and to train Le Net (Le Cun et al., 1998) on the MNIST dataset (Le Cun et al., 2010) and Res Net (He et al., 2016) on the CIFAR100 dataset (Krizhevsky, 2009), respectively. |
| Researcher Affiliation | Collaboration | Congliang Chen EMAIL The Chinese University of Hong Kong, Shenzhen Li Shen EMAIL JD Explore Academy Fangyu Zou EMAIL Meta Wei Liu EMAIL Tencent |
| Pseudocode | Yes | Algorithm 1 Generic Adam Parameters: Set suitable base learning rate {αt}, momentum parameter {βt}, and exponential moving average parameter {θt}, respectively. Choose x1 Rd and set initial values m0 = 0 Rd and v0 = ϵ Rd. for t = 1, 2, . . . , T do Sample a stochastic gradient gt; for k = 1, 2, . . . , d do vt,k = θtvt 1,k + (1 θt)g2 t,k; mt,k = βtmt 1,k + (1 βt)gt,k; xt+1,k = xt,k αtmt,k/ vt,k; |
| Open Source Code | No | The text does not contain any explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository. The license information refers to the paper itself, not its code. |
| Open Datasets | Yes | Le Net (Le Cun et al., 1998) on the MNIST dataset (Le Cun et al., 2010) and Res Net (He et al., 2016) on the CIFAR100 dataset (Krizhevsky, 2009), respectively. Also, we applied mini-batch Adam to train a base model of Transformer XL (Dai et al., 2019) on the dataset Wiki Text-103 (Merity et al., 2016). |
| Dataset Splits | Yes | MNIST (Le Cun et al., 2010) is composed of ten classes of digits among {0, 1, 2, . . . , 9}, which includes 60,000 training examples and 10,000 validation examples. CIFAR-100 (Le Cun et al., 2010) is composed of 100 classes of 32x32 color images. Each class includes 6,000 images. Besides, these images are divided into 50,000 training examples and 10,000 validation examples. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU or CPU models. It only mentions general experimental settings. |
| Software Dependencies | No | The paper mentions common software tools and frameworks used in deep learning, but it does not specify version numbers for any of them, such as Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | In the experiments, for Generic Adam, we set θ(r) t = 1 (0.001 + 0.999r)/tr with r {0, 0.25, 0.5, 0.75, 1} and βt = 0.9, respectively; for RMSProp, we set βt = 0 and θt = 1 1 t along with the parameter settings in Mukkamala and Hein (2017). For fairness, the base learning rates αt in Generic Adam, RMSProp, and AMSGrad are all set as 0.001/t. ... We use different batchsizes {32, 64, 128} to train networks. Besides, when training Res Net-18 on the CIFAR100 dataset, we use an ℓ2 regularization on weights, the coefficient of the regularization term to 5e-4. We use grid search in [1e-2, 5e-3, 1e-3, 5e-4, 1e-4] for αt with respect to test accuracy. In addition, when training Res Net-18 on the CIFAR100 dataset, αt will reduce to 0.2 αt every 19550 iterations (50 epochs for the 128 batchsize setting), which (learning rate decay) is commonly used in practice . |