Memory Augmented Optimizers for Deep Learning

Authors: Paul-Aymeric Martin McRae, Prasanna Parthasarathi, Mido Assran, Sarath Chandar

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that the memory augmented extensions of standard optimizers enjoy accelerated convergence and improved performance on a majority of computer vision and language tasks that we considered. Additionally, we prove that the proposed class of optimizers with fixed-size memory converge under assumptions of strong convexity, regardless of which gradients are selected or how they are linearly combined to form the update step.
Researcher Affiliation Academia Paul-Aymeric Mc Rae 1, Prasanna Parthasarathi 1,2, Mahmoud Assran1,2, and Sarath Chandar1,3,4 1Mila Quebec AI Institute, Canada 2Mc Gill University, Canada 3École Polytechnique de Montréal, Canada 4Canada CIFAR AI Chair
Pseudocode Yes Algorithm 1 Critical Gradients Optimization
Open Source Code Yes Code to reproduce results: https://github.com/chandar-lab/Critical Gradient Optimization. Optimizer package: https://github.com/chandar-lab/CGOptimizer
Open Datasets Yes To understand the performance on common deep learning datasets, we experiment with shallow/deep convolutional neural network architectures (CO) on CIFAR 10/100 (Krizhevsky et al., 2009) respectively; Bi-LSTM (BL), Infer Sent (I), and a text-based convolutional architecture (C) on the Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) dataset; LSTM on word level language modeling (LS) with the Penn Tree Bank (Marcus et al., 1993), and Wiki Text (Merity et al., 2016) datasets; Ro BERTa-Base (Ro B) and Bi LSTM (BL) on language generation in dialogue task with Multi Wo Z 2.0 dataset (Budzianowski et al., 2018). Additionally, we preform analysis experiments using logistic regression (LR) and multi layer perceptrons (MLP) on the MNIST digit classification dataset (Le Cun and Cortes, 2010).
Dataset Splits Yes Table 1 details the train-valid-test splits of all the data sets used in the experiments. Dataset #Train #Valid #Test covtype 5000 N/A N/A rcv1 5000 N/A N/A MNIST 50K 10K 10K CIFAR-10 40K 10K 10K CIFAR-100 40K 10K 10K SNLI 550K 10K 10K Wiki Text 2M 213K 241K Penn Tree Bank 890K 70K 78K Multi Wo Z 115K (1.5M) 20K (200K) 20K (200K)
Hardware Specification Yes Description of the computing infrastructure used: We used 50 NVIDIA V100 32GB GPUs in parallel hyper parameter search over the grid using wandb and submitit packages. For the final runs we used 1 NVIDIA V100 32 GB GPUs for every seed of every model.
Software Dependencies Yes We use Py Torch 1.1 for the experiments and use their implementation of the base optimizers available in torch.optim.
Experiment Setup Yes Our experimental results are aggregated from 5 independent runs, with the hyperparameters for each optimizer extensively tuned. This involves tuning the learning rate in all optimizers, the top C and decay parameters in all C algorithms, and all optimizer-specific hyperparameters in both the vanilla versions and their C counterparts. A full description of the values used to tune the various parameters, architecture and dataset details are in Appendix E, D, and C respectively.