Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

Authors: Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our novel recurrent FWPs (RFWPs) on two synthetic algorithmic tasks (code execution and sequential List Ops), Wikitext-103 language models, and on the Atari 2600 2D game environment. Our models exhibit properties of Transformers and RNNs. In the reinforcement learning setting, we report large improvements over LSTM in several Atari games.
Researcher Affiliation Academia 1The Swiss AI Lab, IDSIA, University of Lugano (USI) & SUPSI, Lugano, Switzerland 2King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Pseudocode No The paper describes mathematical formulations and architectural modifications but does not include a dedicated pseudocode or algorithm block.
Open Source Code Yes Our code is public.1 1https://github.com/IDSIA/recurrent-fwp
Open Datasets Yes We use the Wikitext-103 dataset [28] and follow the small model setting similar to what s used in recent works by Peng et al. [21] and Schlag et al. [23].
Dataset Splits Yes All models are trained and evaluated on the span of 256 tokens except for the models in the last two rows (+ full context) which are trained and evaluated without context truncation.
Hardware Specification Yes We thank NVIDIA Corporation for donating several DGX machines, and IBM for donating a Minsky machine.
Software Dependencies No The paper mentions 'regular Py Torch code [56]' and that experiments were 'implemented in Torchbeast [67]', but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes In line with the small LSTM used for Atari (only 1 layer with 256 hidden nodes) we also configure a small RDN: 2 layers with a hidden size of 128 using 4 heads, and a feedforward dimension of 512. For the rest, we use the same hyperparameters as Espeholt et al. [65] which can be found in Appendix C.