Approximating Real-Time Recurrent Learning with Random Kronecker Factors

Authors: Asier Mujika, Florian Meier, Angelika Steger

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also confirm these theoretical results experimentally. Further, we show empirically that the KF-RTRL algorithm captures long-term dependencies and almost matches the performance of TBPTT on real world tasks by training Recurrent Highway Networks on a synthetic string memorization task and on the Penn Tree Bank task, respectively.
Researcher Affiliation Academia Asier Mujika Department of Computer Science ETH Zürich, Switzerland asierm@inf.ethz.ch Florian Meier Department of Computer Science ETH Zürich, Switzerland meierflo@inf.ethz.ch Angelika Steger Department of Computer Science ETH Zürich, Switzerland steger@inf.ethz.ch
Pseudocode Yes The detailed algorithmic steps of KF-RTRL are presented in Algorithm 1 and motivated below. Algorithm 1 One step of KF-RTRL (from time t 1 to t)
Open Source Code No The paper does not provide any explicit statement about making the source code available or include a link to a code repository.
Open Datasets Yes For this experiment we use the Penn Tree Bank [10] dataset, which is a collection of Wall Street Journal articles.
Dataset Splits Yes We split the data following Mikolov et al. [13]. Figure 2: Validation performance on Penn Tree Bank in bits per character (BPC). Table 1: Results on Penn Tree Bank. Merity et al. [12] is currently the state of the art (trained with TBPTT). For simplicity we do not report standard deviations, as all of them are smaller than 0.03.
Hardware Specification No The paper does not provide specific hardware details (like GPU/CPU models or types of machines) used for running the experiments.
Software Dependencies No The paper mentions software like 'Tensorflow' and 'Adam optimizer' but does not specify any version numbers for these dependencies.
Experiment Setup Yes We use a RHN with 256 units and a batch size of 256. We optimize the log-likelihood using the Adam optimizer [7] with default Tensorflow [1] parameters, β1 = 0.9 and β2 = 0.999. For each model we pick the optimal learning rate from {10 2.5, 10 3, 10 3.5, 10 4}. We repeat each experiment 5 times. Apart from that, we reset the hidden state to the all zeros state with probability 0.01 at each time step.