reproducibilityindex.ai

Sub-Linear Memory: How to Make Performers SLiM

Authors: Valerii Likhosherstov, Krzysztof M. Choromanski, Jared Quincy Davis, Xingyou Song, Adrian Weller

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the proposed tradeoff empirically, and conﬁrm backward-compatibility for the synthetic Copying Task and language modelling on Penn Treebank [25] and Enwik8 [24] datasets. We analyse 4 model conﬁgurations (L, s, dmodel): I = (8192, 1, 1024), II = (1024, 3, 512), III = (4096, 3, 1024), IV = (16384, 3, 1024). In all conﬁgurations, we set dff = 4dmodel, k = dmodel/64 (number of heads). We set M = d and employ g(x) = (x2 i )d i=1 elementwise-quadratic feature mapping in (1), which we ﬁnd to work well. In all experiments Σ = {0, . . . , 255} and batch size is set to 1, i.e. we analyse a setup where gradient accumulation cannot be used to decrease memory, and therefore our algorithm is crucial.
Researcher Affiliation	Collaboration	Valerii Likhosherstov University of Cambridge vl304@cam.ac.uk Krzysztof Choromanski Google Brain & Columbia University Jared Davis Deep Mind & Stanford University Xingyou Song Google Brain Adrian Weller University of Cambridge & Alan Turing Institute
Pseudocode	Yes	Algorithm 1 Low-memory forwardbackward pass. See Algorithm 2 for update Proc. Compared to notation from the text, redundant indices are dropped and tensor names are reused here and in Algorithm 2. ... Algorithm 2 update Proc procedure.
Open Source Code	Yes	1Code: https://github.com/google-research/google-research/tree/master/performer/ models/slim_performer.
Open Datasets	Yes	We evaluate the proposed tradeoff empirically, and conﬁrm backward-compatibility for the synthetic Copying Task and language modelling on Penn Treebank [25] and Enwik8 [24] datasets.
Dataset Splits	Yes	For each setup, we compare training with full gradient computation, ﬁne-tuning regime, when the ﬁrst half of iterations is run using the full algorithm, and the second half is run using Algorithm 1 using various values of C. In addition, we include training from scratch equipped with memory-efﬁcient gradient computation via Algorithm 1. Figure 4 demonstrates results: all methods result in almost same performance. This conﬁrms that memory-efﬁcient gradient computation is backward-compatible during training. ... To analyze the scenario when model is pretrained on server and then ﬁne-tuned (F/T) with a small C on a low-memory device, we add the following experiment. We take a pretrained model from either PTB or ENW setup from Section 4.3 and subsample randomly 5000 examples from the corresponding validation set.
Hardware Specification	Yes	To ensure that reproduction of experiments is accessible for a wider audience, we use a single NVIDIA Tesla P100 GPU with 16 GB memory for each experiment.
Software Dependencies	Yes	Our code is in Py Torch 1.7 [30].
Experiment Setup	Yes	We analyse 4 model conﬁgurations (L, s, dmodel): I = (8192, 1, 1024), II = (1024, 3, 512), III = (4096, 3, 1024), IV = (16384, 3, 1024). In all conﬁgurations, we set dff = 4dmodel, k = dmodel/64 (number of heads). We set M = d and employ g(x) = (x2 i )d i=1 elementwise-quadratic feature mapping in (1), which we ﬁnd to work well. In all experiments Σ = {0, . . . , 255} and batch size is set to 1, i.e. we analyse a setup where gradient accumulation cannot be used to decrease memory, and therefore our algorithm is crucial. ... We perform a one-step gradient descent with 0.01 learning rate (tuned on other random subset) to minimize the loss computed on the ﬁrst half of each sequence and evaluate Bits Per Character (BPC) on the second half.