reproducibilityindex.ai

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Authors: Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, Benjamin Recht

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We additionally present numerical experiments demonstrating that adaptive methods generalize worse than their non-adaptive counterparts. Our experiments reveal three primary ﬁndings. First, with the same amount of hyperparameter tuning, SGD and SGD with momentum outperform adaptive methods on the development/test set across all evaluated models and tasks. This is true even when the adaptive methods achieve the same training loss or lower than non-adaptive methods. Second, adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development/test set.
Researcher Affiliation	Academia	]University of California, Berkeley Toyota Technological Institute at Chicago
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It provides mathematical formulations for algorithms but no procedural pseudocode.
Open Source Code	No	The paper links to external code repositories for the network architectures used (e.g., 'cifar.torch: https://github. com/szagoruyko/cifar.torch'), but these are third-party implementations of models. The paper does not state that its own methodology, analysis, or experimental setup code is open-source or publicly available.
Open Datasets	Yes	We study performance on four deep learning problems: (C1) the CIFAR-10 image classiﬁcation task, (L1) character-level language modeling on the novel War and Peace, and (L2) discriminative parsing and (L3) generative parsing on Penn Treebank.
Dataset Splits	Yes	We allocate a pre-speciﬁed budget on the number of epochs used for training each model. When a development set was available, we chose the settings that achieved the best peak performance on the development set by the end of the ﬁxed epoch budget. CIFAR-10 did not have an explicit development set, so we chose the settings that achieved the lowest training loss at the end of the ﬁxed epoch budget.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It only mentions general frameworks like Torch, DyNet, and TensorFlow.
Software Dependencies	No	The paper mentions software frameworks like 'Torch', 'Dy Net', and 'Tensorﬂow' but does not provide specific version numbers for these or any other ancillary software components, which are necessary for reproducible descriptions.
Experiment Setup	Yes	To tune the step sizes, we evaluated a logarithmically-spaced grid of ﬁve step sizes... For step size decay, we explored two separate schemes, a development-based decay scheme (dev-decay) and a ﬁxed frequency decay scheme (ﬁxed-decay).