reproducibilityindex.ai

Extreme Tensoring for Low-Memory Preconditioning

Authors: Xinyi Chen, Naman Agarwal, Elad Hazan, Cyril Zhang, Yi Zhang

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On a large-scale NLP model, we reduce the optimizer memory overhead by three orders of magnitude, without degrading performance. In this section, we provide several empirical studies on extreme tensoring. Our main experiment interpolates between the memorylessness of SGD and the full memory consumption of Ada Grad on a large-scale language model; we additionally isolate the effects of preconditioner expressivity in a synthetic experiment of the same form, and provide a parallel CIFAR-10 experiment in the appendix.
Researcher Affiliation	Collaboration	Xinyi Chen Google AI Princeton Princeton, NJ xinyic@google.com Naman Agarwal Google AI Princeton Princeton, NJ namanagarwal@google.com Elad Hazan Princeton University & Google AI Princeton Princeton, NJ ehazan@cs.princeton.edu Cyril Zhang Princeton University & Google AI Princeton Princeton, NJ cyril.zhang@princeton.edu Yi Zhang Princeton University Princeton, NJ y.zhang@cs.princeton.edu
Pseudocode	Yes	Algorithm 1 Ada Grad with extreme tensoring
Open Source Code	No	The paper mentions using the open-source Tensor2Tensor package (Vaswani et al., 2018) but does not provide any statement or link for the source code of their own proposed methodology.
Open Datasets	Yes	Our main empirical study focuses on large-scale language modeling with the Transformer architecture (Vaswani et al., 2017) on the Google Billion Words (GBW) dataset (Chelba et al., 2013), and the results are shown in Figure 2. In this section, we evaluate the memory-performance trade-off of our proposed algorithm on the CIFAR-10 dataset (Krizhevsky, 2009).
Dataset Splits	No	The paper mentions hyperparameter tuning and learning rate schedules but does not provide specific details on training/validation/test dataset splits (exact percentages, sample counts, or explicit splitting methodology).
Hardware Specification	Yes	We trained the model on one V100 GPU, with parallel hyperparameter tuning on the global learning rate multiplier.
Software Dependencies	No	The paper mentions using the Tensor2Tensor framework (Vaswani et al., 2018) and Vizier (Golovin et al., 2017) but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	For our experiments, we use a learning rate schedule of ηt = c min(10 6 t, 1/t), and c is a hyperparameter we tune for each experiment. ... We train each model for 500K steps on GPUs, with a max sequence length of 256 tokens, and a max number of 4096 tokens in a batch. Global learning rates are selected by hyperparameter search. We use batch size 128, weight decay 5 10 4 for all the experiments in this section.