reproducibilityindex.ai

Towards Better Generalization of Adaptive Gradient Methods

Authors: Yingxue Zhou, Belhal Karimi, Jinxing Yu, Zhiqiang Xu, Ping Li

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We further conduct experiments on various popular deep learning tasks and models. Experimental results illustrate that SAGD is empirically competitive and often better than baselines. In this section, we evaluate our proposed mini-batch SAGD algorithm on various deep learning models against popular optimization methods: SGD with momentum [29], Adam [19], RMSprop [35], and Adabound [24].
Researcher Affiliation	Industry	Yingxue Zhou, Belhal Karimi, Jinxing Yu, Zhiqiang Xu, Ping Li Cognitive Computing Lab Baidu Research No.10 Xibeiwang East Road, Beijing 100193, China 10900 NE 8th St. Bellevue, Washington 98004, USA
Pseudocode	Yes	Algorithm 1 SAGD with DGP-LAP, Algorithm 2 SAGD with DPG-SPARSE, Algorithm 3 Mini-Batch SAGD
Open Source Code	No	No concrete access to source code for the described methodology was found.
Open Datasets	Yes	We consider three tasks: the classiﬁcation tasks on MNIST [22] and CIFAR-10 [20], and the language modeling task on Penn Treebank [25] and the SNLI dataset [3], corpus of 570 000 human-written English sentence pairs where the goal is to predict if an hypothesis is an entailment, contradiction or neutral with respect to a given text.
Dataset Splits	Yes	The Penn Treebank dataset contains 929589, 73760, and 82430 tokens for training, validation, and test, respectively.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were mentioned.
Software Dependencies	No	The paper mentions various optimization methods (e.g., SGD, Adam, RMSprop) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch versions).
Experiment Setup	Yes	The mini-batch size is set to be 128 for CIFAR10 and MNIST, 20 for Penn Treebank and SNLI. We run 100 epochs and decay the learning rate by 0.5 every 30 epochs. We use σ = 0.8 for Re LU and σ = 1.0 for Sigmoid. We run 200 epochs and decay the learning rate by 0.1 every 30 epochs. We use σ = 0.01 for both Res Net-18 and VGG-19. We train them for a ﬁxed budget of 500 epochs and omit the learning-rate decay. We use σ = 0.01 for both models. We use 300 dimensions as ﬁxed word embeddings and set the learning rate following the method described above. We set noise parameter σ = 0.01.