Learning Associative Inference Using Fast Weight Memory
Authors: Imanol Schlag, Tsendsuren Munkhdalai, Jürgen Schmidhuber
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Section 4 demonstrates the generality of our method through experiments in the supervised, self-supervised, and meta-reinforcement learning setting. |
| Researcher Affiliation | Collaboration | Imanol Schlag The Swiss AI Lab IDSIA / USI / SUPSI imanol@idsia.ch Tsendsuren Munkhdalai Microsoft Research tsendsuren.munkhdalai@microsoft.com J urgen Schmidhuber The Swiss AI Lab IDSIA / USI / SUPSI juergen@idsia.ch |
| Pseudocode | Yes | Listing 1: Python3 code to sample new environments such that any state is reachable by any other state. |
| Open Source Code | Yes | Source code and data used in this paper is available at github.com/ischlag/Fast-Weight-Memory-public |
| Open Datasets | Yes | Source code and data used in this paper is available at github.com/ischlag/Fast-Weight-Memory-public" and "Penn Treebank (PTB; Mikolov et al. (2010)) or Wiki Text-2 (WT2; Merity et al. (2017))" and "We provide the preprocessed catb Ab I data together with our code so future work can compare using the same validation and test sequence. |
| Dataset Splits | Yes | We used the same train/test/valid split of the data as in regular b Ab I." and "Table 3: Statistics of the catb Ab I dataset based on our preprocessing of the regular b Ab I data. subset number of tokens number of stories number of questions train ~5M 56,376 179,909 valid ~560k 6,245 19,907 test ~560k 6,247 19,910 |
| Hardware Specification | Yes | limited the amount of GPU memory to ~16GB for practical reasons." and "We thank NVIDIA Corporation for donating several DGX machines, and IBM for donating a Minsky machine. |
| Software Dependencies | No | The paper mentions software like Python3, PyTorch, Transformer-XL implementation, and Adam optimizer but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We truncate backpropagation through time (t BPTT) to 200 tokens for all models and limited the amount of GPU memory to ~16GB for practical reasons. For every model, we performed a hyperparameter search in QA mode over the first 3k steps of which a smaller selection was trained for 30-60k steps. For example, for FWM: 'We set d LSTM = 256, d FWM = 32, Nr = 3 and searched experimented with two seeds for batch sizes 64, 128 and learning rates 0.0001, 0.00025, 0.0005, 0.001, 0.002.' (More details are provided in sections F.1 to F.4) |