The Marginal Value of Adaptive Gradient Methods in Machine Learning

Authors: Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, Benjamin Recht

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We additionally present numerical experiments demonstrating that adaptive methods generalize worse than their non-adaptive counterparts. Our experiments reveal three primary findings. First, with the same amount of hyperparameter tuning, SGD and SGD with momentum outperform adaptive methods on the development/test set across all evaluated models and tasks. This is true even when the adaptive methods achieve the same training loss or lower than non-adaptive methods. Second, adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development/test set.
Researcher Affiliation Academia ]University of California, Berkeley Toyota Technological Institute at Chicago
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It provides mathematical formulations for algorithms but no procedural pseudocode.
Open Source Code No The paper links to external code repositories for the network architectures used (e.g., 'cifar.torch: https://github. com/szagoruyko/cifar.torch'), but these are third-party implementations of models. The paper does not state that its own methodology, analysis, or experimental setup code is open-source or publicly available.
Open Datasets Yes We study performance on four deep learning problems: (C1) the CIFAR-10 image classification task, (L1) character-level language modeling on the novel War and Peace, and (L2) discriminative parsing and (L3) generative parsing on Penn Treebank.
Dataset Splits Yes We allocate a pre-specified budget on the number of epochs used for training each model. When a development set was available, we chose the settings that achieved the best peak performance on the development set by the end of the fixed epoch budget. CIFAR-10 did not have an explicit development set, so we chose the settings that achieved the lowest training loss at the end of the fixed epoch budget.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It only mentions general frameworks like Torch, DyNet, and TensorFlow.
Software Dependencies No The paper mentions software frameworks like 'Torch', 'Dy Net', and 'Tensorflow' but does not provide specific version numbers for these or any other ancillary software components, which are necessary for reproducible descriptions.
Experiment Setup Yes To tune the step sizes, we evaluated a logarithmically-spaced grid of five step sizes... For step size decay, we explored two separate schemes, a development-based decay scheme (dev-decay) and a fixed frequency decay scheme (fixed-decay).