Extreme Tensoring for Low-Memory Preconditioning
Authors: Xinyi Chen, Naman Agarwal, Elad Hazan, Cyril Zhang, Yi Zhang
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On a large-scale NLP model, we reduce the optimizer memory overhead by three orders of magnitude, without degrading performance. In this section, we provide several empirical studies on extreme tensoring. Our main experiment interpolates between the memorylessness of SGD and the full memory consumption of Ada Grad on a large-scale language model; we additionally isolate the effects of preconditioner expressivity in a synthetic experiment of the same form, and provide a parallel CIFAR-10 experiment in the appendix. |
| Researcher Affiliation | Collaboration | Xinyi Chen Google AI Princeton Princeton, NJ xinyic@google.com Naman Agarwal Google AI Princeton Princeton, NJ namanagarwal@google.com Elad Hazan Princeton University & Google AI Princeton Princeton, NJ ehazan@cs.princeton.edu Cyril Zhang Princeton University & Google AI Princeton Princeton, NJ cyril.zhang@princeton.edu Yi Zhang Princeton University Princeton, NJ y.zhang@cs.princeton.edu |
| Pseudocode | Yes | Algorithm 1 Ada Grad with extreme tensoring |
| Open Source Code | No | The paper mentions using the open-source Tensor2Tensor package (Vaswani et al., 2018) but does not provide any statement or link for the source code of their own proposed methodology. |
| Open Datasets | Yes | Our main empirical study focuses on large-scale language modeling with the Transformer architecture (Vaswani et al., 2017) on the Google Billion Words (GBW) dataset (Chelba et al., 2013), and the results are shown in Figure 2. In this section, we evaluate the memory-performance trade-off of our proposed algorithm on the CIFAR-10 dataset (Krizhevsky, 2009). |
| Dataset Splits | No | The paper mentions hyperparameter tuning and learning rate schedules but does not provide specific details on training/validation/test dataset splits (exact percentages, sample counts, or explicit splitting methodology). |
| Hardware Specification | Yes | We trained the model on one V100 GPU, with parallel hyperparameter tuning on the global learning rate multiplier. |
| Software Dependencies | No | The paper mentions using the Tensor2Tensor framework (Vaswani et al., 2018) and Vizier (Golovin et al., 2017) but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For our experiments, we use a learning rate schedule of ηt = c min(10 6 t, 1/t), and c is a hyperparameter we tune for each experiment. ... We train each model for 500K steps on GPUs, with a max sequence length of 256 tokens, and a max number of 4096 tokens in a batch. Global learning rates are selected by hyperparameter search. We use batch size 128, weight decay 5 10 4 for all the experiments in this section. |