Sub-Linear Memory: How to Make Performers SLiM
Authors: Valerii Likhosherstov, Krzysztof M. Choromanski, Jared Quincy Davis, Xingyou Song, Adrian Weller
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed tradeoff empirically, and confirm backward-compatibility for the synthetic Copying Task and language modelling on Penn Treebank [25] and Enwik8 [24] datasets. We analyse 4 model configurations (L, s, dmodel): I = (8192, 1, 1024), II = (1024, 3, 512), III = (4096, 3, 1024), IV = (16384, 3, 1024). In all configurations, we set dff = 4dmodel, k = dmodel/64 (number of heads). We set M = d and employ g(x) = (x2 i )d i=1 elementwise-quadratic feature mapping in (1), which we find to work well. In all experiments Σ = {0, . . . , 255} and batch size is set to 1, i.e. we analyse a setup where gradient accumulation cannot be used to decrease memory, and therefore our algorithm is crucial. |
| Researcher Affiliation | Collaboration | Valerii Likhosherstov University of Cambridge vl304@cam.ac.uk Krzysztof Choromanski Google Brain & Columbia University Jared Davis Deep Mind & Stanford University Xingyou Song Google Brain Adrian Weller University of Cambridge & Alan Turing Institute |
| Pseudocode | Yes | Algorithm 1 Low-memory forwardbackward pass. See Algorithm 2 for update Proc. Compared to notation from the text, redundant indices are dropped and tensor names are reused here and in Algorithm 2. ... Algorithm 2 update Proc procedure. |
| Open Source Code | Yes | 1Code: https://github.com/google-research/google-research/tree/master/performer/ models/slim_performer. |
| Open Datasets | Yes | We evaluate the proposed tradeoff empirically, and confirm backward-compatibility for the synthetic Copying Task and language modelling on Penn Treebank [25] and Enwik8 [24] datasets. |
| Dataset Splits | Yes | For each setup, we compare training with full gradient computation, fine-tuning regime, when the first half of iterations is run using the full algorithm, and the second half is run using Algorithm 1 using various values of C. In addition, we include training from scratch equipped with memory-efficient gradient computation via Algorithm 1. Figure 4 demonstrates results: all methods result in almost same performance. This confirms that memory-efficient gradient computation is backward-compatible during training. ... To analyze the scenario when model is pretrained on server and then fine-tuned (F/T) with a small C on a low-memory device, we add the following experiment. We take a pretrained model from either PTB or ENW setup from Section 4.3 and subsample randomly 5000 examples from the corresponding validation set. |
| Hardware Specification | Yes | To ensure that reproduction of experiments is accessible for a wider audience, we use a single NVIDIA Tesla P100 GPU with 16 GB memory for each experiment. |
| Software Dependencies | Yes | Our code is in Py Torch 1.7 [30]. |
| Experiment Setup | Yes | We analyse 4 model configurations (L, s, dmodel): I = (8192, 1, 1024), II = (1024, 3, 512), III = (4096, 3, 1024), IV = (16384, 3, 1024). In all configurations, we set dff = 4dmodel, k = dmodel/64 (number of heads). We set M = d and employ g(x) = (x2 i )d i=1 elementwise-quadratic feature mapping in (1), which we find to work well. In all experiments Σ = {0, . . . , 255} and batch size is set to 1, i.e. we analyse a setup where gradient accumulation cannot be used to decrease memory, and therefore our algorithm is crucial. ... We perform a one-step gradient descent with 0.01 learning rate (tuned on other random subset) to minimize the loss computed on the first half of each sequence and evaluate Bits Per Character (BPC) on the second half. |