Learning Hierarchical Information Flow with Recurrent Neural Modules
Authors: Danijar Hafner, Alexander Irpan, James Davidson, Nicolas Heess
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that our model learns to route information hierarchically, processing input data by a chain of modules. We observe common architectures, such as feed forward neural networks and skip connections, emerging as special cases of our architecture, while novel connectivity patterns are learned for the text8 compression task. Our model outperforms standard recurrent neural networks on several sequential benchmarks. |
| Researcher Affiliation | Industry | Danijar Hafner Google Brain mail@danijar.com Alex Irpan Google Brain alexirpan@google.com James Davidson Google Brain jcdavidson@google.com Nicolas Heess Google Deep Mind heess@google.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code, such as a specific repository link, an explicit code release statement, or code in supplementary materials, for the methodology described. |
| Open Datasets | Yes | Sequential Permuted MNIST. We use images from the MNIST [19] data set, the pixels of every image by a fixed random permutation, and show them to the model as a sequence of rows. Sequential CIFAR-10. In a similar spirit, we use the CIFAR-10 [17] data set and feed images to the model row by row. Text8 Language Modeling. This text corpus consisting of the first 108 bytes of the English Wikipedia is commonly used as a language modeling benchmark for sequential models. |
| Dataset Splits | Yes | We use the standard split of 60,000 training images and 10,000 testing images. The data set contains 50,000 training images and 10,000 testing images. Following Cooijmans et al. [4], we train on the first 90% and evaluate performance on the following 5% of the corpus. |
| Hardware Specification | No | The paper mentions training durations (e.g., 'The training took about 8 days') but does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions optimizers like RMSProp and Adam, but does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | For all models, we pick the largest layer sizes such that the number of parameters does not exceed 50,000. Training is performed for 100 epochs on batches of size 50 using RMSProp [30] with a learning rate of 10 3. We train on batches of 100 sequences, each containing 200 bytes, using the Adam optimizer [15] with a default learning rate of 10 3. We scale down gradients exceeding a norm of 1. Models are trained for 50 epochs on batches of size 10 containing sequences of length 50 using RMSProp with a learning rate of 10 3. |