On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Authors: Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A key practical result is the derivation of a square root scaling rule to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.
Researcher Affiliation Academia Sadhika Malladi Kaifeng Lyu Abhishek Panigrahi Sanjeev Arora Department of Computer Science Princeton University {smalladi,klyu,ap34,arora}@cs.princeton.edu
Pseudocode No The paper describes algorithms like RMSprop, Adam, and SVAG, but does not present them in a formalized 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes We include the code for the vision experiments in the supplementary material. For the NLP experiments, we use the code of Wettig et al. (2022).
Open Datasets Yes Figures 1 and 2 show the square root scaling rule applied to Res Net-50 (He et al., 2016) and VGG-16 (Simonyan and Zisserman, 2014) trained on CIFAR-10 (Krizhevsky et al.), Ro BERTa-large (Liu et al., 2019) trained on the Wiki+Books corpus (Zhu et al., 2015), 12-layer GPT (Brown et al., 2020) on Wiki Text-103 (Merity et al., 2017) and Res Net-50 trained on Image Net (Deng et al., 2009).
Dataset Splits No The paper mentions 'Test Accuracy' and 'Validation Log Perplexity' but does not explicitly state the dataset split percentages or specific methodology used for creating train/validation/test splits.
Hardware Specification Yes We ran our experiments on a cluster of 34 GPUs, where 24 are RTX 2080 GPUs and 10 are A5000 GPUs. Each experiment on CIFAR-10 required a single RTX 2080 GPU, each experiment on Image Net required a single A5000 GPU, each pretraining experiment on GPT required a set of 4 RTX 2080 GPUs, each pretraining experiment on Ro BERTa required a set of 8 RTX 2080 GPUs, and each finetuning experiment on Ro BERTa required a single RTX 2080 GPU.
Software Dependencies No The paper mentions using the code of Wettig et al. (2022) but does not specify the versions of software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) or Python.
Experiment Setup No Appendix J contains the training details of all the experiments.