Can recurrent neural networks warp time?
Authors: Corentin Tallec, Yann Ollivier
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, this new chrono initialization is shown to greatly improve learning of long term dependencies, with minimal implementation effort. ... We test the empirical benefits of the new initialization on both synthetic and real world data (Section 3). ... For synthetic tasks, optimization is performed using RMSprop (Tieleman & Hinton, 2012) with a learning rate of 10 3 and a moving average parameter of 0.9. No gradient clipping is performed; this results in a few short-lived spikes in the plots below, which do not affect final performance. |
| Researcher Affiliation | Collaboration | Corentin Tallec Laboratoire de Recherche en Informatique Université Paris Sud Gif-sur-Yvette, 91190, France corentin.tallec@u-psud.fr Yann Ollivier Facebook Articial Intelligence Research Paris, France yol@fb.com |
| Pseudocode | No | The paper provides mathematical equations but no pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | In addition (Appendix A), we test the chrono initialization on next character prediction on the Text8 (Mahoney, 2011) dataset, and on next word prediction on the Penn Treebank dataset (Mikolov et al., 2012). |
| Dataset Splits | Yes | For each value of maximum_warping, the train dataset consists of 50, 000 length-500 randomly warped random sequences, with either uniform or variable time warpings. ... Test datasets of 10, 000 sequences are generated similarily. ... Mikolov et al., 2012) s train-valid-test split is used: the first 90M characters are used as training set, the next 5M as validation set and the last 5M as test set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It only mentions 'single layer LSTMs' and 'RHN' networks. |
| Software Dependencies | No | The paper mentions software components like 'RMSprop', 'Adam', and 'recurrent batch normalization' but does not provide specific version numbers for any of these, or for general programming languages or libraries. |
| Experiment Setup | Yes | RMSprop with an 𝛼parameter of 0.9 and a batch size of 32 is used. For faster convergence, learning rates are divided by 2 each time the evaluation loss has not decreased after 100 batches. ... For synthetic tasks, optimization is performed using RMSprop (Tieleman & Hinton, 2012) with a learning rate of 10 3 and a moving average parameter of 0.9. No gradient clipping is performed... For the standard initialization (baseline), the forget gate biases are set to 1. For the new initialization, the forget gate and input gate biases are chosen according to the chrono initialization (16)... Adam (Kingma & Ba, 2014) with learning rate 10 3, batches of size 128 made of non-overlapping sequences of length 180, and gradient clipping at 1.0. |