A Modern Self-Referential Weight Matrix That Learns to Modify Itself
Authors: Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our SRWM in supervised few-shot learning and in multi-task reinforcement learning with procedurally generated game environments. Our experiments demonstrate both practical applicability and competitive performance of the proposed SRWM. Our code is public. |
| Researcher Affiliation | Collaboration | 1The Swiss AI Lab, IDSIA, USI & SUPSI, Lugano, Switzerland 2AI Initiative, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia. Correspondence to: <{kazuki, imanol, robert, juergen}@idsia.ch>. |
| Pseudocode | No | The paper describes the model dynamics using mathematical equations (Eqs. 1-8) and figures, but does not provide pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is public. https://github.com/IDSIA/modern-srwm |
| Open Datasets | Yes | We conduct experiments on the classic Omniglot (Lake et al., 2015) and Mini Image Net (Vinyals et al., 2016; Ravi & Larochelle, 2017) datasets. For further details on the datasets, we refer to the respective references and Appendix B where we also provide extra experimental results on the Fewshot-CIFAR100 (Oreshkin et al., 2018) dataset. ... We use torchmeta by Deleu et al. (2019) which implements all common settings used with these datasets. |
| Dataset Splits | Yes | For each dataset, classes are split into train, validation and test for few-shot learning settings. Omniglot: 1028/172/432-split for the train/validation/test set. Mini-Image Net: standard class train/valid/test splits of 64/16/20 are used (Ravi & Larochelle, 2017). FC100: 100 color image classes (600 images per class, each of size 32x32) are split into train/valid/test classes of 60/20/20 (Oreshkin et al., 2018). |
| Hardware Specification | Yes | Regarding speed, the feedforward and LSTM baselines process about 3,500 steps per second, while Delta Net and SRWM do 2,300 and 1,700 steps per second respectively on a single P100 GPU in the RL experiments which require slow state copying due to separate interaction and training modes. ... With a batch size of 128, they process about 8,000 images per second, using the same Conv-4 backend on 1-shot Omniglot on a single P100 GPU. |
| Software Dependencies | No | The paper mentions "Torchbeast (Küttler et al., 2019)" and "PyTorch" but does not specify version numbers for these or other software libraries. |
| Experiment Setup | Yes | For Omniglot, we use two layers of size 256 using 16 computational heads and 1024 (4 * 256) dimensional feed-forward inner dimensions. We train with a learning rate of 1e-3 with a batch size of 128 for 300 K steps and validate every 1000 steps. For Mini-Image Net, we conduct hyper-parameter search for the SRWM and the Delta Net as follows: a number of layers l {2, 3, 4}, a hidden size dmodel {128, 256}, two dropout rates pvision, p {0.0, 0.1, 0.2, 0.3} (separately for the vision and sequence processing components) and a learning rate η {1e−3, 3e−4, 1e−4} or the standard Transformer warmup learning rate scheduling (Vaswani et al., 2017). The number of heads is fixed to 16. We set a feed-forward inner dimension to dff = m dmodel where m {4, 8}. |