Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
Authors: Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our novel recurrent FWPs (RFWPs) on two synthetic algorithmic tasks (code execution and sequential List Ops), Wikitext-103 language models, and on the Atari 2600 2D game environment. Our models exhibit properties of Transformers and RNNs. In the reinforcement learning setting, we report large improvements over LSTM in several Atari games. |
| Researcher Affiliation | Academia | 1The Swiss AI Lab, IDSIA, University of Lugano (USI) & SUPSI, Lugano, Switzerland 2King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia |
| Pseudocode | No | The paper describes mathematical formulations and architectural modifications but does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is public.1 1https://github.com/IDSIA/recurrent-fwp |
| Open Datasets | Yes | We use the Wikitext-103 dataset [28] and follow the small model setting similar to what s used in recent works by Peng et al. [21] and Schlag et al. [23]. |
| Dataset Splits | Yes | All models are trained and evaluated on the span of 256 tokens except for the models in the last two rows (+ full context) which are trained and evaluated without context truncation. |
| Hardware Specification | Yes | We thank NVIDIA Corporation for donating several DGX machines, and IBM for donating a Minsky machine. |
| Software Dependencies | No | The paper mentions 'regular Py Torch code [56]' and that experiments were 'implemented in Torchbeast [67]', but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | In line with the small LSTM used for Atari (only 1 layer with 256 hidden nodes) we also configure a small RDN: 2 layers with a hidden size of 128 using 4 heads, and a feedforward dimension of 512. For the rest, we use the same hyperparameters as Espeholt et al. [65] which can be found in Appendix C. |