Aggregated Momentum: Stability Through Passive Damping
Authors: James Lucas, Shengyang Sun, Richard Zemel, Roger Grosse
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate Agg Mo empirically we compare against other commonly used optimizers on a range of deep learning architectures: deep autoencoders, convolutional networks, and long-term short-term memory (LSTM). |
| Researcher Affiliation | Academia | James Lucas, Shengyang Sun, Richard Zemel, Roger Grosse University of Toronto; Vector Institute {jlucas, ssy, zemel, rgrosse}@cs.toronto.edu |
| Pseudocode | No | The paper describes the Agg Mo update rule using mathematical equations (Equation 3) but does not provide a formal pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide a link to open-source code or explicitly state that the code is publicly available. |
| Open Datasets | Yes | To do so we used four datasets: MNIST (Le Cun et al., 1998), CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009) and Penn Treebank (Marcus et al., 1993). |
| Dataset Splits | Yes | For these experiments the training set consists of 90% of the training data with the remaining 10% being used for validation. |
| Hardware Specification | No | The paper mentions that experiments were conducted using the PyTorch library but does not specify any hardware details like GPU or CPU models. |
| Software Dependencies | No | All of our experiments are conducted using the pytorch library Paszke et al. (2017). The paper mentions PyTorch but does not provide a specific version number for it or other software dependencies. |
| Experiment Setup | Yes | For CM and Nesterov we evaluated damping coefficients in the range: {0.0, 0.9, 0.99, 0.999}. For Adam, it is standard to use β1 = 0.9 and β2 = 0.999. Since β1 is analogous to the momentum damping parameter, we considered β1 {0.9, 0.99, 0.999} and kept β2 = 0.999. For Agg Mo, we explored K in { 2,3,4 }. Each model was trained for 1000 epochs. ... We train for a total of 1000 epochs using a multiplicative learning rate decay of 0.1 at 200, 400, and 800 epochs. We train using batch sizes of 200. |