A Clockwork RNN
Authors: Jan Koutnik, Klaus Greff, Faustino Gomez, Juergen Schmidhuber
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The network is demonstrated in preliminary experiments involving three tasks: audio signal generation, TIMIT spoken word classification, where it outperforms both SRN and LSTM networks, and online handwriting recognition, where it outperforms SRNs. |
| Researcher Affiliation | Academia | Jan Koutn ık HKOU@IDSIA.CH Klaus Greff KLAUS@IDSIA.CH Faustino Gomez TINO@IDSIA.CH J urgen Schmidhuber JUERGEN@IDSIA.CH IDSIA, USI&SUPSI, Manno-Lugano, CH-6928, Switzerland |
| Pseudocode | No | The paper describes the CW-RNN architecture and its calculations using equations and descriptive text, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statement about making its source code available or provide a link to a code repository. |
| Open Datasets | Yes | Each sequence contains an audio signal of one spoken word from the TIMIT Speech Recognition Benchmark (Garofolo et al., 1993). The dataset (Liwicki & Bunke, 2005) consists of 5364 hand-written lines of text in the training set and 3859 lines in the test set, and two validation sets that were combined to form one validation set of size 2956. |
| Dataset Splits | Yes | The dataset contains 25 different words (classes) arranged in 5 groups based on their phonetic suffix. For every word there are 7 examples from different speakers, which were partitioned into 5 for training and 2 for testing, for a total of 175 sequences (125 train, 50 test). The dataset (Liwicki & Bunke, 2005) consists of 5364 hand-written lines of text in the training set and 3859 lines in the test set, and two validation sets that were combined to form one validation set of size 2956. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, or memory). |
| Software Dependencies | No | The paper mentions methods like Stochastic Gradient Descent (SGD) with Nesterov-style momentum, tanh activation function, and connectionist temporal classification (CTC), but it does not specify any software libraries or their version numbers (e.g., TensorFlow, PyTorch, scikit-learn versions). |
| Experiment Setup | Yes | Initial values for all the weights were drawn from a Gaussian distribution with zero mean and standard deviation of 0.1. Initial values of all internal state variables for all hidden activations were set to 0. Each setup was run 100 times with different random initialization of parameters. All networks were trained using Stochastic Gradient Descent (SGD) with Nesterov-style momentum (Sutskever et al., 2013). All networks used the same architecture: no inputs, one hidden layer and a single linear output neuron. Each network type was run with 4 different sizes: 100, 250, 500, and 1000 parameters. The networks were trained for 2000 epochs to minimize the mean squared error. Momentum was set to 0.95, with a learning that was optimized separately for each method, but kept the same for all network sizes: 3.0 10 4 for SRN and CW-RNN, and 3.0 10 5 for LSTM. For LSTM it was also crucial to initialize the bias of the forget gates to a high value (5.0 in this case). The hidden units of CW-RNN were divided into nine aproximately equally sized groups with exponential clock-timings {1, 2, 4, . . . , 256}. |