A Computationally Efficient Sparsified Online Newton Method

Authors: Fnu Devvrit, Sai Surya Duvvuri, Rohan Anil, Vineet Gupta, Cho-Jui Hsieh, Inderjit Dhillon

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we test our method on large scale benchmarks of up to 1B parameters. We achieve up to 30% faster convergence, 3.4% relative improvement in validation performance, and 80% relative improvement in training loss, in comparison to memory efficient optimizers including first order methods.
Researcher Affiliation Collaboration Devvrit Department of Computer Science The University of Texas at Austin devvrit.03@gmail.com Sai Surya Duvvuri Department of Computer Science The University of Texas at Austin subramanyamdvss@gmail.com Rohan Anil Google Deep Mind rohananil@google.com Vineet Gupta Google vineet@google.com Cho-Jui Hsieh CS Department, UCLA & Google chohsieh@cs.ucla.edu Inderjit Dhillon Google isd@google.com
Pseudocode Yes Algorithm 1 Sparsified Online Newton (SONew) Algorithm
Open Source Code Yes SONew code is available at: https://github.com/devvrit/SONew
Open Datasets Yes We use three sparsity patterns for SONew a) diagonal sparsity, resulting in a diagonal preconditioner similar to adaptive first order methods like Adam and Adagrad; b) tridiagonal sparsity, corresponding to a chain graph; and c) banded sparsity, represented by 'band-k' in tables and figures for band size of k. We compare SONew against widely used first order methods including SGD [32]), SGD with Momentum [41], Nesterov [40], Adagrad [14], Adam [33], and Rmsprop [48]. We also compare with rfd SON [37], a recently proposed memory efficient second order optimizer and with Shampoo [24], a state of the art second-order optimizer used in practice, albeit with considerable memory and time requirements.
Dataset Splits Yes We achieve up to 30% faster convergence, 3.4% relative improvement in validation performance, and 80% relative improvement in training loss, in comparison to memory efficient optimizers including first order methods.
Hardware Specification Yes each experiment is performed using one V100 GPU having 16 GB memory. [...] We search over 200 hyperparameters using 4 16 GB TPUs (v2) for each run. In order to conduct a fair comparison of the running times, we executed the optimal hyperparameter configurations on 4 32GB TPUs (v4) [31]. [...] All experiments were performed on 16 TPU v4s.
Software Dependencies No The paper mentions 'JAX implementation' and 'Pytorch implementations' but does not specify version numbers for these or other software dependencies.
Experiment Setup Yes We search over 200 hyperparameters using 4 16 GB TPUs (v2) for each run. The search ranges are: first order momentum term β1 [1e 1, 0.999], second order momentum term β2 [1e 1, 0.999], learning rate [1e 7, 1e 1], ϵ [1e 10, 1e 1]. We give the optimal hyperparameter value for each experiment in Table 12. For VIT and Graph Network benchmark, we search β1, β2 [0.1, 0.999], lr [1e 5, 1e 1], ϵ [1e 9, 1e 4], weight decay [1e 5, 1.0], learning rate warmup [2%, 5%, 10%] total_train_steps, dropout [00, 0.1], label smoothing over {0.0, 0.1, 0.2} . We use cosine learning rate schedule. Batch size was kept = 1024, and 512 for Vision Transformer, and Graph Network respectively.