Block Low-Rank Preconditioner with Shared Basis for Stochastic Optimization

Authors: Jui-Nan Yen, Sai Surya Duvvuri, Inderjit Dhillon, Cho-Jui Hsieh

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results on a deep autoencoder and a Transformer benchmark demonstrate that the proposed method outperforms firstorder methods with slightly more time and memory usage, while also achieving competitive or superior performance compared to other second-order methods with less time and memory usage.
Researcher Affiliation Collaboration Jui-Nan Yen UCLA juinanyen@cs.ucla.edu Sai Surya Duvvuri UT Austin saisurya@cs.utexas.edu Inderjit S. Dhillon Google and UT Austin inderjit@cs.utexas.edu Cho-Jui Hsieh Google and UCLA chohsieh@cs.ucla.edu
Pseudocode Yes Algorithm 1 Shared-Basis Low Rank Block-Diagonal Adagrad
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing the code for the work described in this paper, nor does it provide a direct link to a source-code repository for their methodology.
Open Datasets Yes We evaluate the performance using a standard Autoencoder benchmark [27] on the MNIST dataset [8] and a larger Transformer model [30] on the Universal Dependencies dataset [24].
Dataset Splits No The paper refers to 'validation performance' and 'validation error' and uses benchmarks, but does not explicitly provide the specific training/validation/test dataset splits (e.g., percentages or sample counts) used for reproduction.
Hardware Specification Yes For the autoencoder benchmark, we conduct 180 trials of random search on one NVIDIA RTX 2080Ti GPU with 11GB memory. For the Transformer benchmark, we conduct 60 trials of random search on one NVIDIA RTX A6000 GPU with 48GB memory.
Software Dependencies No The paper mentions using the Google Flax repository but does not provide specific version numbers for Flax or any other key software components, libraries, or solvers.
Experiment Setup Yes We adopt k = 32 as the default rank for our methods. For randomized SVD, we set the oversampling parameter to 0 and the number of iterations to 1. Similar to Shampoo, we use the grafting technique [2] in our method. We set the grafting type to RMSPROP_NORMALIZED. The batch size is 1000. A linear warmup of 5 epochs is used for learning rate scheduling followed by a linear decay to 0.