Preventing Gradient Explosions in Gated Recurrent Units
Authors: Sekitoshi Kanai, Yasuhiro Fujiwara, Sotetsu Iwamura
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated our method in experiments on language modeling and polyphonic music modeling. Our experiments showed that our method can prevent the exploding gradient problem and improve modeling accuracy. |
| Researcher Affiliation | Industry | Sekitoshi Kanai, Yasuhiro Fujiwara, Sotetsu Iwamura NTT Software Innovation Center 3-9-11, Midori-cho, Musashino-shi, Tokyo {kanai.sekitoshi, fujiwara.yasuhiro, iwamura.sotetsu}@lab.ntt.co.jp |
| Pseudocode | Yes | We compute Pδ( ) by using the following procedure: Step 1. Decompose ˆ W (τ) hh :=W (τ 1) hh η Whh CDτ(θ) by using singular value decomposition (SVD): ˆ W (τ) hh = UΣV. (12) Step 2. Replace the singular values that are greater than the threshold 2 δ: Σ = diag(min(σ1, 2 δ), . . . min(σn, 2 δ)). (13) Step 3. Reconstruct W (τ) hh by using U, V and Σ in Steps 1 and 2: W (τ) hh U ΣV. (14) |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described, nor does it explicitly state that code is released or available in supplementary materials. |
| Open Datasets | Yes | Penn Treebank (PTB) [25] is a widely used dataset to evaluate the performance of RNNs. ... We used the Nottingham dataset: a MIDI file containing 1200 folk tunes [6]. |
| Dataset Splits | Yes | Penn Treebank (PTB) ... is split into training, validation, and test sets, and the sets are composed of 930 k, 74 k, 80 k tokens. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions computation time but not the hardware it was run on. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | Our model architecture was as follows: The first layer was a 650 10, 000 linear layer without bias to convert the one-hot vector input into a dense vector, and we multiplied the output of the first layer by 0.01 because our method assumes small inputs. The second layer was a GRU layer with 650 units, and we used the softmax function as the output layer. We applied 50 % dropout to the output of each layer except for the recurrent connection [38]. We unfolded the GRU for 35 time steps in BPTT and set the mini-batch size to 20. We trained the GRU with SGD for 75 epochs... We set the learning rate to one in the first 10 epochs, and then, divided the learning rate by 1.1 after each epoch. |