On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
Authors: Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A key practical result is the derivation of a square root scaling rule to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings. |
| Researcher Affiliation | Academia | Sadhika Malladi Kaifeng Lyu Abhishek Panigrahi Sanjeev Arora Department of Computer Science Princeton University {smalladi,klyu,ap34,arora}@cs.princeton.edu |
| Pseudocode | No | The paper describes algorithms like RMSprop, Adam, and SVAG, but does not present them in a formalized 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | We include the code for the vision experiments in the supplementary material. For the NLP experiments, we use the code of Wettig et al. (2022). |
| Open Datasets | Yes | Figures 1 and 2 show the square root scaling rule applied to Res Net-50 (He et al., 2016) and VGG-16 (Simonyan and Zisserman, 2014) trained on CIFAR-10 (Krizhevsky et al.), Ro BERTa-large (Liu et al., 2019) trained on the Wiki+Books corpus (Zhu et al., 2015), 12-layer GPT (Brown et al., 2020) on Wiki Text-103 (Merity et al., 2017) and Res Net-50 trained on Image Net (Deng et al., 2009). |
| Dataset Splits | No | The paper mentions 'Test Accuracy' and 'Validation Log Perplexity' but does not explicitly state the dataset split percentages or specific methodology used for creating train/validation/test splits. |
| Hardware Specification | Yes | We ran our experiments on a cluster of 34 GPUs, where 24 are RTX 2080 GPUs and 10 are A5000 GPUs. Each experiment on CIFAR-10 required a single RTX 2080 GPU, each experiment on Image Net required a single A5000 GPU, each pretraining experiment on GPT required a set of 4 RTX 2080 GPUs, each pretraining experiment on Ro BERTa required a set of 8 RTX 2080 GPUs, and each finetuning experiment on Ro BERTa required a single RTX 2080 GPU. |
| Software Dependencies | No | The paper mentions using the code of Wettig et al. (2022) but does not specify the versions of software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) or Python. |
| Experiment Setup | No | Appendix J contains the training details of all the experiments. |