Aggregated Momentum: Stability Through Passive Damping

Authors: James Lucas, Shengyang Sun, Richard Zemel, Roger Grosse

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate Agg Mo empirically we compare against other commonly used optimizers on a range of deep learning architectures: deep autoencoders, convolutional networks, and long-term short-term memory (LSTM).
Researcher Affiliation Academia James Lucas, Shengyang Sun, Richard Zemel, Roger Grosse University of Toronto; Vector Institute {jlucas, ssy, zemel, rgrosse}@cs.toronto.edu
Pseudocode No The paper describes the Agg Mo update rule using mathematical equations (Equation 3) but does not provide a formal pseudocode or algorithm block.
Open Source Code No The paper does not provide a link to open-source code or explicitly state that the code is publicly available.
Open Datasets Yes To do so we used four datasets: MNIST (Le Cun et al., 1998), CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009) and Penn Treebank (Marcus et al., 1993).
Dataset Splits Yes For these experiments the training set consists of 90% of the training data with the remaining 10% being used for validation.
Hardware Specification No The paper mentions that experiments were conducted using the PyTorch library but does not specify any hardware details like GPU or CPU models.
Software Dependencies No All of our experiments are conducted using the pytorch library Paszke et al. (2017). The paper mentions PyTorch but does not provide a specific version number for it or other software dependencies.
Experiment Setup Yes For CM and Nesterov we evaluated damping coefficients in the range: {0.0, 0.9, 0.99, 0.999}. For Adam, it is standard to use β1 = 0.9 and β2 = 0.999. Since β1 is analogous to the momentum damping parameter, we considered β1 {0.9, 0.99, 0.999} and kept β2 = 0.999. For Agg Mo, we explored K in { 2,3,4 }. Each model was trained for 1000 epochs. ... We train for a total of 1000 epochs using a multiplicative learning rate decay of 0.1 at 200, 400, and 800 epochs. We train using batch sizes of 200.