Linear Transformers Are Secretly Fast Weight Programmers

Authors: Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on synthetic retrieval problems as well as standard machine translation and language modelling tasks which demonstrate the benefits of our methods.
Researcher Affiliation Academia 1The Swiss AI Lab IDSIA, USI & SUPSI. Correspondence to: Imanol Schlag <imanol@idsia.ch>, Kazuki Irie <kazuki@idsia.ch>, J urgen Schmidhuber <juergen@idsia.ch>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Source code used in this paper is available at github.com/ischlag/fast-weight-transformers.
Open Datasets Yes We use the standard WMT14 English to German machine translation task and standard data setups (Ott et al., 2018; Vaswani et al., 2017). We use the standard Wiki Text-103 (Merity et al., 2017) dataset.
Dataset Splits Yes In Figure 2, the best validation set performance for each model and each S is displayed (for the learning curves see Appendix D.1). The validation and test sets also contain similarly long dependencies, respectively with 218 K and 246 K running words for 60 articles each.
Hardware Specification Yes train Vaswani et al. (2017) s big models for about 4 days on three V100 GPUs.
Software Dependencies No The paper mentions "Py Torch" and "CUDA kernels" but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes All models are trained with a mini-batch size of 32 until the evaluation loss falls below 0.001 or until lack of progress for 1000 steps. For evaluation, we sample 20 sequences and test all possible queries... Each model is trained in mini-batches using this loss and Adam with default hyperparameters unless stated otherwise. In the small configuration, we set the model dimension (same for key, value, and query) D to 128, and the training and evaluation context length L to 256. We note that D = H ddot where H is the number of heads. H is set to 8. The feed-forward layer dimension is 2048. The number of layers is 16 in all configurations. In the medium configuration, we set D = 256 and L = 384.