Memory Augmented Optimizers for Deep Learning
Authors: Paul-Aymeric Martin McRae, Prasanna Parthasarathi, Mido Assran, Sarath Chandar
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that the memory augmented extensions of standard optimizers enjoy accelerated convergence and improved performance on a majority of computer vision and language tasks that we considered. Additionally, we prove that the proposed class of optimizers with fixed-size memory converge under assumptions of strong convexity, regardless of which gradients are selected or how they are linearly combined to form the update step. |
| Researcher Affiliation | Academia | Paul-Aymeric Mc Rae 1, Prasanna Parthasarathi 1,2, Mahmoud Assran1,2, and Sarath Chandar1,3,4 1Mila Quebec AI Institute, Canada 2Mc Gill University, Canada 3École Polytechnique de Montréal, Canada 4Canada CIFAR AI Chair |
| Pseudocode | Yes | Algorithm 1 Critical Gradients Optimization |
| Open Source Code | Yes | Code to reproduce results: https://github.com/chandar-lab/Critical Gradient Optimization. Optimizer package: https://github.com/chandar-lab/CGOptimizer |
| Open Datasets | Yes | To understand the performance on common deep learning datasets, we experiment with shallow/deep convolutional neural network architectures (CO) on CIFAR 10/100 (Krizhevsky et al., 2009) respectively; Bi-LSTM (BL), Infer Sent (I), and a text-based convolutional architecture (C) on the Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) dataset; LSTM on word level language modeling (LS) with the Penn Tree Bank (Marcus et al., 1993), and Wiki Text (Merity et al., 2016) datasets; Ro BERTa-Base (Ro B) and Bi LSTM (BL) on language generation in dialogue task with Multi Wo Z 2.0 dataset (Budzianowski et al., 2018). Additionally, we preform analysis experiments using logistic regression (LR) and multi layer perceptrons (MLP) on the MNIST digit classification dataset (Le Cun and Cortes, 2010). |
| Dataset Splits | Yes | Table 1 details the train-valid-test splits of all the data sets used in the experiments. Dataset #Train #Valid #Test covtype 5000 N/A N/A rcv1 5000 N/A N/A MNIST 50K 10K 10K CIFAR-10 40K 10K 10K CIFAR-100 40K 10K 10K SNLI 550K 10K 10K Wiki Text 2M 213K 241K Penn Tree Bank 890K 70K 78K Multi Wo Z 115K (1.5M) 20K (200K) 20K (200K) |
| Hardware Specification | Yes | Description of the computing infrastructure used: We used 50 NVIDIA V100 32GB GPUs in parallel hyper parameter search over the grid using wandb and submitit packages. For the final runs we used 1 NVIDIA V100 32 GB GPUs for every seed of every model. |
| Software Dependencies | Yes | We use Py Torch 1.1 for the experiments and use their implementation of the base optimizers available in torch.optim. |
| Experiment Setup | Yes | Our experimental results are aggregated from 5 independent runs, with the hyperparameters for each optimizer extensively tuned. This involves tuning the learning rate in all optimizers, the top C and decay parameters in all C algorithms, and all optimizer-specific hyperparameters in both the vanilla versions and their C counterparts. A full description of the values used to tune the various parameters, architecture and dataset details are in Appendix E, D, and C respectively. |