Approximating Real-Time Recurrent Learning with Random Kronecker Factors
Authors: Asier Mujika, Florian Meier, Angelika Steger
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also confirm these theoretical results experimentally. Further, we show empirically that the KF-RTRL algorithm captures long-term dependencies and almost matches the performance of TBPTT on real world tasks by training Recurrent Highway Networks on a synthetic string memorization task and on the Penn Tree Bank task, respectively. |
| Researcher Affiliation | Academia | Asier Mujika Department of Computer Science ETH Zürich, Switzerland asierm@inf.ethz.ch Florian Meier Department of Computer Science ETH Zürich, Switzerland meierflo@inf.ethz.ch Angelika Steger Department of Computer Science ETH Zürich, Switzerland steger@inf.ethz.ch |
| Pseudocode | Yes | The detailed algorithmic steps of KF-RTRL are presented in Algorithm 1 and motivated below. Algorithm 1 One step of KF-RTRL (from time t 1 to t) |
| Open Source Code | No | The paper does not provide any explicit statement about making the source code available or include a link to a code repository. |
| Open Datasets | Yes | For this experiment we use the Penn Tree Bank [10] dataset, which is a collection of Wall Street Journal articles. |
| Dataset Splits | Yes | We split the data following Mikolov et al. [13]. Figure 2: Validation performance on Penn Tree Bank in bits per character (BPC). Table 1: Results on Penn Tree Bank. Merity et al. [12] is currently the state of the art (trained with TBPTT). For simplicity we do not report standard deviations, as all of them are smaller than 0.03. |
| Hardware Specification | No | The paper does not provide specific hardware details (like GPU/CPU models or types of machines) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like 'Tensorflow' and 'Adam optimizer' but does not specify any version numbers for these dependencies. |
| Experiment Setup | Yes | We use a RHN with 256 units and a batch size of 256. We optimize the log-likelihood using the Adam optimizer [7] with default Tensorflow [1] parameters, β1 = 0.9 and β2 = 0.999. For each model we pick the optimal learning rate from {10 2.5, 10 3, 10 3.5, 10 4}. We repeat each experiment 5 times. Apart from that, we reset the hidden state to the all zeros state with probability 0.01 at each time step. |