reproducibilityindex.ai

Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

Authors: Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our novel recurrent FWPs (RFWPs) on two synthetic algorithmic tasks (code execution and sequential List Ops), Wikitext-103 language models, and on the Atari 2600 2D game environment. Our models exhibit properties of Transformers and RNNs. In the reinforcement learning setting, we report large improvements over LSTM in several Atari games.
Researcher Affiliation	Academia	1The Swiss AI Lab, IDSIA, University of Lugano (USI) & SUPSI, Lugano, Switzerland 2King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Pseudocode	No	The paper describes mathematical formulations and architectural modifications but does not include a dedicated pseudocode or algorithm block.
Open Source Code	Yes	Our code is public.1 1https://github.com/IDSIA/recurrent-fwp
Open Datasets	Yes	We use the Wikitext-103 dataset [28] and follow the small model setting similar to what s used in recent works by Peng et al. [21] and Schlag et al. [23].
Dataset Splits	Yes	All models are trained and evaluated on the span of 256 tokens except for the models in the last two rows (+ full context) which are trained and evaluated without context truncation.
Hardware Specification	Yes	We thank NVIDIA Corporation for donating several DGX machines, and IBM for donating a Minsky machine.
Software Dependencies	No	The paper mentions 'regular Py Torch code [56]' and that experiments were 'implemented in Torchbeast [67]', but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	In line with the small LSTM used for Atari (only 1 layer with 256 hidden nodes) we also conﬁgure a small RDN: 2 layers with a hidden size of 128 using 4 heads, and a feedforward dimension of 512. For the rest, we use the same hyperparameters as Espeholt et al. [65] which can be found in Appendix C.