Learning Associative Inference Using Fast Weight Memory

Authors: Imanol Schlag, Tsendsuren Munkhdalai, Jürgen Schmidhuber

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Section 4 demonstrates the generality of our method through experiments in the supervised, self-supervised, and meta-reinforcement learning setting.
Researcher Affiliation Collaboration Imanol Schlag The Swiss AI Lab IDSIA / USI / SUPSI imanol@idsia.ch Tsendsuren Munkhdalai Microsoft Research tsendsuren.munkhdalai@microsoft.com J urgen Schmidhuber The Swiss AI Lab IDSIA / USI / SUPSI juergen@idsia.ch
Pseudocode Yes Listing 1: Python3 code to sample new environments such that any state is reachable by any other state.
Open Source Code Yes Source code and data used in this paper is available at github.com/ischlag/Fast-Weight-Memory-public
Open Datasets Yes Source code and data used in this paper is available at github.com/ischlag/Fast-Weight-Memory-public" and "Penn Treebank (PTB; Mikolov et al. (2010)) or Wiki Text-2 (WT2; Merity et al. (2017))" and "We provide the preprocessed catb Ab I data together with our code so future work can compare using the same validation and test sequence.
Dataset Splits Yes We used the same train/test/valid split of the data as in regular b Ab I." and "Table 3: Statistics of the catb Ab I dataset based on our preprocessing of the regular b Ab I data. subset number of tokens number of stories number of questions train ~5M 56,376 179,909 valid ~560k 6,245 19,907 test ~560k 6,247 19,910
Hardware Specification Yes limited the amount of GPU memory to ~16GB for practical reasons." and "We thank NVIDIA Corporation for donating several DGX machines, and IBM for donating a Minsky machine.
Software Dependencies No The paper mentions software like Python3, PyTorch, Transformer-XL implementation, and Adam optimizer but does not provide specific version numbers for these software components.
Experiment Setup Yes We truncate backpropagation through time (t BPTT) to 200 tokens for all models and limited the amount of GPU memory to ~16GB for practical reasons. For every model, we performed a hyperparameter search in QA mode over the first 3k steps of which a smaller selection was trained for 30-60k steps. For example, for FWM: 'We set d LSTM = 256, d FWM = 32, Nr = 3 and searched experimented with two seeds for batch sizes 64, 128 and learning rates 0.0001, 0.00025, 0.0005, 0.001, 0.002.' (More details are provided in sections F.1 to F.4)