Towards Better Generalization of Adaptive Gradient Methods

Authors: Yingxue Zhou, Belhal Karimi, Jinxing Yu, Zhiqiang Xu, Ping Li

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further conduct experiments on various popular deep learning tasks and models. Experimental results illustrate that SAGD is empirically competitive and often better than baselines. In this section, we evaluate our proposed mini-batch SAGD algorithm on various deep learning models against popular optimization methods: SGD with momentum [29], Adam [19], RMSprop [35], and Adabound [24].
Researcher Affiliation Industry Yingxue Zhou, Belhal Karimi, Jinxing Yu, Zhiqiang Xu, Ping Li Cognitive Computing Lab Baidu Research No.10 Xibeiwang East Road, Beijing 100193, China 10900 NE 8th St. Bellevue, Washington 98004, USA
Pseudocode Yes Algorithm 1 SAGD with DGP-LAP, Algorithm 2 SAGD with DPG-SPARSE, Algorithm 3 Mini-Batch SAGD
Open Source Code No No concrete access to source code for the described methodology was found.
Open Datasets Yes We consider three tasks: the classification tasks on MNIST [22] and CIFAR-10 [20], and the language modeling task on Penn Treebank [25] and the SNLI dataset [3], corpus of 570 000 human-written English sentence pairs where the goal is to predict if an hypothesis is an entailment, contradiction or neutral with respect to a given text.
Dataset Splits Yes The Penn Treebank dataset contains 929589, 73760, and 82430 tokens for training, validation, and test, respectively.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were mentioned.
Software Dependencies No The paper mentions various optimization methods (e.g., SGD, Adam, RMSprop) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch versions).
Experiment Setup Yes The mini-batch size is set to be 128 for CIFAR10 and MNIST, 20 for Penn Treebank and SNLI. We run 100 epochs and decay the learning rate by 0.5 every 30 epochs. We use σ = 0.8 for Re LU and σ = 1.0 for Sigmoid. We run 200 epochs and decay the learning rate by 0.1 every 30 epochs. We use σ = 0.01 for both Res Net-18 and VGG-19. We train them for a fixed budget of 500 epochs and omit the learning-rate decay. We use σ = 0.01 for both models. We use 300 dimensions as fixed word embeddings and set the learning rate following the method described above. We set noise parameter σ = 0.01.