A Modern Self-Referential Weight Matrix That Learns to Modify Itself

Authors: Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our SRWM in supervised few-shot learning and in multi-task reinforcement learning with procedurally generated game environments. Our experiments demonstrate both practical applicability and competitive performance of the proposed SRWM. Our code is public.
Researcher Affiliation Collaboration 1The Swiss AI Lab, IDSIA, USI & SUPSI, Lugano, Switzerland 2AI Initiative, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia. Correspondence to: <{kazuki, imanol, robert, juergen}@idsia.ch>.
Pseudocode No The paper describes the model dynamics using mathematical equations (Eqs. 1-8) and figures, but does not provide pseudocode or algorithm blocks.
Open Source Code Yes Our code is public. https://github.com/IDSIA/modern-srwm
Open Datasets Yes We conduct experiments on the classic Omniglot (Lake et al., 2015) and Mini Image Net (Vinyals et al., 2016; Ravi & Larochelle, 2017) datasets. For further details on the datasets, we refer to the respective references and Appendix B where we also provide extra experimental results on the Fewshot-CIFAR100 (Oreshkin et al., 2018) dataset. ... We use torchmeta by Deleu et al. (2019) which implements all common settings used with these datasets.
Dataset Splits Yes For each dataset, classes are split into train, validation and test for few-shot learning settings. Omniglot: 1028/172/432-split for the train/validation/test set. Mini-Image Net: standard class train/valid/test splits of 64/16/20 are used (Ravi & Larochelle, 2017). FC100: 100 color image classes (600 images per class, each of size 32x32) are split into train/valid/test classes of 60/20/20 (Oreshkin et al., 2018).
Hardware Specification Yes Regarding speed, the feedforward and LSTM baselines process about 3,500 steps per second, while Delta Net and SRWM do 2,300 and 1,700 steps per second respectively on a single P100 GPU in the RL experiments which require slow state copying due to separate interaction and training modes. ... With a batch size of 128, they process about 8,000 images per second, using the same Conv-4 backend on 1-shot Omniglot on a single P100 GPU.
Software Dependencies No The paper mentions "Torchbeast (Küttler et al., 2019)" and "PyTorch" but does not specify version numbers for these or other software libraries.
Experiment Setup Yes For Omniglot, we use two layers of size 256 using 16 computational heads and 1024 (4 * 256) dimensional feed-forward inner dimensions. We train with a learning rate of 1e-3 with a batch size of 128 for 300 K steps and validate every 1000 steps. For Mini-Image Net, we conduct hyper-parameter search for the SRWM and the Delta Net as follows: a number of layers l {2, 3, 4}, a hidden size dmodel {128, 256}, two dropout rates pvision, p {0.0, 0.1, 0.2, 0.3} (separately for the vision and sequence processing components) and a learning rate η {1e−3, 3e−4, 1e−4} or the standard Transformer warmup learning rate scheduling (Vaswani et al., 2017). The number of heads is fixed to 16. We set a feed-forward inner dimension to dff = m dmodel where m {4, 8}.