reproducibilityindex.ai

Linear Transformers Are Secretly Fast Weight Programmers

Authors: Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on synthetic retrieval problems as well as standard machine translation and language modelling tasks which demonstrate the beneﬁts of our methods.
Researcher Affiliation	Academia	1The Swiss AI Lab IDSIA, USI & SUPSI. Correspondence to: Imanol Schlag <imanol@idsia.ch>, Kazuki Irie <kazuki@idsia.ch>, J urgen Schmidhuber <juergen@idsia.ch>.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Source code used in this paper is available at github.com/ischlag/fast-weight-transformers.
Open Datasets	Yes	We use the standard WMT14 English to German machine translation task and standard data setups (Ott et al., 2018; Vaswani et al., 2017). We use the standard Wiki Text-103 (Merity et al., 2017) dataset.
Dataset Splits	Yes	In Figure 2, the best validation set performance for each model and each S is displayed (for the learning curves see Appendix D.1). The validation and test sets also contain similarly long dependencies, respectively with 218 K and 246 K running words for 60 articles each.
Hardware Specification	Yes	train Vaswani et al. (2017) s big models for about 4 days on three V100 GPUs.
Software Dependencies	No	The paper mentions "Py Torch" and "CUDA kernels" but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	All models are trained with a mini-batch size of 32 until the evaluation loss falls below 0.001 or until lack of progress for 1000 steps. For evaluation, we sample 20 sequences and test all possible queries... Each model is trained in mini-batches using this loss and Adam with default hyperparameters unless stated otherwise. In the small conﬁguration, we set the model dimension (same for key, value, and query) D to 128, and the training and evaluation context length L to 256. We note that D = H ddot where H is the number of heads. H is set to 8. The feed-forward layer dimension is 2048. The number of layers is 16 in all conﬁgurations. In the medium conﬁguration, we set D = 256 and L = 384.