Sub-Linear Memory: How to Make Performers SLiM

Authors: Valerii Likhosherstov, Krzysztof M. Choromanski, Jared Quincy Davis, Xingyou Song, Adrian Weller

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed tradeoff empirically, and confirm backward-compatibility for the synthetic Copying Task and language modelling on Penn Treebank [25] and Enwik8 [24] datasets. We analyse 4 model configurations (L, s, dmodel): I = (8192, 1, 1024), II = (1024, 3, 512), III = (4096, 3, 1024), IV = (16384, 3, 1024). In all configurations, we set dff = 4dmodel, k = dmodel/64 (number of heads). We set M = d and employ g(x) = (x2 i )d i=1 elementwise-quadratic feature mapping in (1), which we find to work well. In all experiments Σ = {0, . . . , 255} and batch size is set to 1, i.e. we analyse a setup where gradient accumulation cannot be used to decrease memory, and therefore our algorithm is crucial.
Researcher Affiliation Collaboration Valerii Likhosherstov University of Cambridge vl304@cam.ac.uk Krzysztof Choromanski Google Brain & Columbia University Jared Davis Deep Mind & Stanford University Xingyou Song Google Brain Adrian Weller University of Cambridge & Alan Turing Institute
Pseudocode Yes Algorithm 1 Low-memory forwardbackward pass. See Algorithm 2 for update Proc. Compared to notation from the text, redundant indices are dropped and tensor names are reused here and in Algorithm 2. ... Algorithm 2 update Proc procedure.
Open Source Code Yes 1Code: https://github.com/google-research/google-research/tree/master/performer/ models/slim_performer.
Open Datasets Yes We evaluate the proposed tradeoff empirically, and confirm backward-compatibility for the synthetic Copying Task and language modelling on Penn Treebank [25] and Enwik8 [24] datasets.
Dataset Splits Yes For each setup, we compare training with full gradient computation, fine-tuning regime, when the first half of iterations is run using the full algorithm, and the second half is run using Algorithm 1 using various values of C. In addition, we include training from scratch equipped with memory-efficient gradient computation via Algorithm 1. Figure 4 demonstrates results: all methods result in almost same performance. This confirms that memory-efficient gradient computation is backward-compatible during training. ... To analyze the scenario when model is pretrained on server and then fine-tuned (F/T) with a small C on a low-memory device, we add the following experiment. We take a pretrained model from either PTB or ENW setup from Section 4.3 and subsample randomly 5000 examples from the corresponding validation set.
Hardware Specification Yes To ensure that reproduction of experiments is accessible for a wider audience, we use a single NVIDIA Tesla P100 GPU with 16 GB memory for each experiment.
Software Dependencies Yes Our code is in Py Torch 1.7 [30].
Experiment Setup Yes We analyse 4 model configurations (L, s, dmodel): I = (8192, 1, 1024), II = (1024, 3, 512), III = (4096, 3, 1024), IV = (16384, 3, 1024). In all configurations, we set dff = 4dmodel, k = dmodel/64 (number of heads). We set M = d and employ g(x) = (x2 i )d i=1 elementwise-quadratic feature mapping in (1), which we find to work well. In all experiments Σ = {0, . . . , 255} and batch size is set to 1, i.e. we analyse a setup where gradient accumulation cannot be used to decrease memory, and therefore our algorithm is crucial. ... We perform a one-step gradient descent with 0.01 learning rate (tuned on other random subset) to minimize the loss computed on the first half of each sequence and evaluate Bits Per Character (BPC) on the second half.