Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning
Authors: Frederik Benzing, Marcelo Matheus Gauy, Asier Mujika, Anders Martinsson, Angelika Steger
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a new approximation algorithm of RTRL, Optimal Kronecker-Sum Approximation (OK). We prove that OK is optimal for a class of approximations of RTRL, which includes all approaches published so far. Additionally, we show that OK has empirically negligible noise: Unlike previous algorithms it matches TBPTT in a real world task (character-level Penn Tree Bank) and can exploit online parameter updates to outperform TBPTT in a synthetic string memorization task. Code available at Git Hub. |
| Researcher Affiliation | Academia | 1Department of Computer Science, ETH Zurich, Zurich, Switzerland. Correspondence to: FB <benzingf@inf.ethz.ch>, MMG <marcelo.matheus@inf.ethz.ch>. |
| Pseudocode | Yes | Algorithm 1 One step of unbiasedly approximating RTRL. Algorithm 2 The OK approximation Algorithm 3 Opt(C) |
| Open Source Code | Yes | Code available at Git Hub. |
| Open Datasets | Yes | The second, character-level language modeling on the Penn Tree Bank dataset (CHAR-PTB), is a complex real-world task commonly used to assess the capabilities of RNNs. |
| Dataset Splits | Yes | We split the data following (Mikolov et al., 2012). |
| Hardware Specification | No | The paper mentions network sizes (e.g., 'RHN with 128 units') but does not specify any hardware details such as CPU/GPU models, memory, or cloud computing instances used for experiments. |
| Software Dependencies | No | We optimize the log-likelihood using the Adam optimizer (Kingma & Ba, 2015) with default Tensorflow (Abadi et al., 2016) parameters, β1 = 0.9 and β2 = 0.999. |
| Experiment Setup | Yes | We use curriculum learning and start with T = 1, increasing T by one when the RNN error drops below 0.15 bits/char. After each sequence, the hidden states are reset to zero. To improve performance, the length of the sequence is sampled uniformly at random from T 5 to T. We use a RHN with 128 units and a batch size of 16. We optimize the log-likelihood using the Adam optimizer (Kingma & Ba, 2015) with default Tensorflow (Abadi et al., 2016) parameters, β1 = 0.9 and β2 = 0.999. For each model, we pick the best learning rate from {10 2.5, 10 3, 10 3.5, 10 4}. We repeat each experiment 5 times. |