Reducing the variance in online optimization by transporting past gradients
Authors: Sébastien Arnold, Pierre-Antoine Manzagol, Reza Babanezhad Harikandeh, Ioannis Mitliagkas, Nicolas Le Roux
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show experimentally that it achieves state-of-the-art results on a wide range of architectures and benchmarks. Additionally, the IGT gradient estimator yields the optimal asymptotic convergence rate for online stochastic optimization in the restricted setting where the Hessians of all component functions are equal.2 |
| Researcher Affiliation | Collaboration | Sébastien M. R. Arnold University of Southern California Los Angeles, CA seb.arnold@usc.edu Pierre-Antoine Manzagol Google Brain Montréal, QC manzagop@google.com Reza Babanezhad University of British Columbia Vancouver, BC rezababa@cs.ubc.ca Ioannis Mitliagkas Mila, Université de Montréal Montréal, QC ioannis@iro.umontreal.ca Nicolas Le Roux Mila, Google Brain Montréal, QC nlr@google.com |
| Pseudocode | Yes | Algorithm 1 Heavyball-IGT |
| Open Source Code | Yes | Open-source implementation available at: https://github.com/seba-1511/igt.pth |
| Open Datasets | Yes | CIFAR10 image classification We first consider the task of training a Res Net-56 model [12] on the CIFAR-10 image classification dataset [19]. ... Image Net image classification We also consider the task of training a Res Net-50 model[12] on the larger Image Net dataset [36]. ... IMDb sentiment analysis We train a bi-directional LSTM on the IMDb Large Movie Review Dataset for 200 epochs. [27] ... Mini-Imagenet dataset [34]. |
| Dataset Splits | No | The paper mentions using validation sets and reports 'validation accuracies', but does not provide specific details on the train/validation/test dataset splits (e.g., percentages or sample counts) for reproducibility. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running experiments, such as GPU models, CPU types, or cloud computing instance details. |
| Software Dependencies | No | The paper mentions 'We used TF official models code and setup [1]' but does not provide specific version numbers for TensorFlow or any other software dependencies needed to replicate the experiments. |
| Experiment Setup | Yes | We tuned the step size for each algorithm by running experiments using a logarithmic grid. ... We used a linearly decreasing stepsize as it was shown to be simple and perform well [43]. ... For each optimizer we selected the hyperparameter combination that is fastest to reach a consistently attainable target train loss [43]. ... we trained using larger minibatches (1024 instead of 128). ... We train a bi-directional LSTM on the IMDb Large Movie Review Dataset for 200 epochs. ... We replicate the 5 ways classification setup with 5 adaptation steps on tasks from the Mini-Imagenet dataset [34]. ... select the stepsize that maximizes the validation accuracy after 10K iterations, and use it to train the model for 100K iterations. |