A Computationally Efficient Sparsified Online Newton Method
Authors: Fnu Devvrit, Sai Surya Duvvuri, Rohan Anil, Vineet Gupta, Cho-Jui Hsieh, Inderjit Dhillon
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we test our method on large scale benchmarks of up to 1B parameters. We achieve up to 30% faster convergence, 3.4% relative improvement in validation performance, and 80% relative improvement in training loss, in comparison to memory efficient optimizers including first order methods. |
| Researcher Affiliation | Collaboration | Devvrit Department of Computer Science The University of Texas at Austin devvrit.03@gmail.com Sai Surya Duvvuri Department of Computer Science The University of Texas at Austin subramanyamdvss@gmail.com Rohan Anil Google Deep Mind rohananil@google.com Vineet Gupta Google vineet@google.com Cho-Jui Hsieh CS Department, UCLA & Google chohsieh@cs.ucla.edu Inderjit Dhillon Google isd@google.com |
| Pseudocode | Yes | Algorithm 1 Sparsified Online Newton (SONew) Algorithm |
| Open Source Code | Yes | SONew code is available at: https://github.com/devvrit/SONew |
| Open Datasets | Yes | We use three sparsity patterns for SONew a) diagonal sparsity, resulting in a diagonal preconditioner similar to adaptive first order methods like Adam and Adagrad; b) tridiagonal sparsity, corresponding to a chain graph; and c) banded sparsity, represented by 'band-k' in tables and figures for band size of k. We compare SONew against widely used first order methods including SGD [32]), SGD with Momentum [41], Nesterov [40], Adagrad [14], Adam [33], and Rmsprop [48]. We also compare with rfd SON [37], a recently proposed memory efficient second order optimizer and with Shampoo [24], a state of the art second-order optimizer used in practice, albeit with considerable memory and time requirements. |
| Dataset Splits | Yes | We achieve up to 30% faster convergence, 3.4% relative improvement in validation performance, and 80% relative improvement in training loss, in comparison to memory efficient optimizers including first order methods. |
| Hardware Specification | Yes | each experiment is performed using one V100 GPU having 16 GB memory. [...] We search over 200 hyperparameters using 4 16 GB TPUs (v2) for each run. In order to conduct a fair comparison of the running times, we executed the optimal hyperparameter configurations on 4 32GB TPUs (v4) [31]. [...] All experiments were performed on 16 TPU v4s. |
| Software Dependencies | No | The paper mentions 'JAX implementation' and 'Pytorch implementations' but does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We search over 200 hyperparameters using 4 16 GB TPUs (v2) for each run. The search ranges are: first order momentum term β1 [1e 1, 0.999], second order momentum term β2 [1e 1, 0.999], learning rate [1e 7, 1e 1], ϵ [1e 10, 1e 1]. We give the optimal hyperparameter value for each experiment in Table 12. For VIT and Graph Network benchmark, we search β1, β2 [0.1, 0.999], lr [1e 5, 1e 1], ϵ [1e 9, 1e 4], weight decay [1e 5, 1.0], learning rate warmup [2%, 5%, 10%] total_train_steps, dropout [00, 0.1], label smoothing over {0.0, 0.1, 0.2} . We use cosine learning rate schedule. Batch size was kept = 1024, and 512 for Vision Transformer, and Graph Network respectively. |