reproducibilityindex.ai

Preventing Gradient Explosions in Gated Recurrent Units

Authors: Sekitoshi Kanai, Yasuhiro Fujiwara, Sotetsu Iwamura

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated our method in experiments on language modeling and polyphonic music modeling. Our experiments showed that our method can prevent the exploding gradient problem and improve modeling accuracy.
Researcher Affiliation	Industry	Sekitoshi Kanai, Yasuhiro Fujiwara, Sotetsu Iwamura NTT Software Innovation Center 3-9-11, Midori-cho, Musashino-shi, Tokyo {kanai.sekitoshi, fujiwara.yasuhiro, iwamura.sotetsu}@lab.ntt.co.jp
Pseudocode	Yes	We compute Pδ( ) by using the following procedure: Step 1. Decompose ˆ W (τ) hh :=W (τ 1) hh η Whh CDτ(θ) by using singular value decomposition (SVD): ˆ W (τ) hh = UΣV. (12) Step 2. Replace the singular values that are greater than the threshold 2 δ: Σ = diag(min(σ1, 2 δ), . . . min(σn, 2 δ)). (13) Step 3. Reconstruct W (τ) hh by using U, V and Σ in Steps 1 and 2: W (τ) hh U ΣV. (14)
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described, nor does it explicitly state that code is released or available in supplementary materials.
Open Datasets	Yes	Penn Treebank (PTB) [25] is a widely used dataset to evaluate the performance of RNNs. ... We used the Nottingham dataset: a MIDI ﬁle containing 1200 folk tunes [6].
Dataset Splits	Yes	Penn Treebank (PTB) ... is split into training, validation, and test sets, and the sets are composed of 930 k, 74 k, 80 k tokens.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions computation time but not the hardware it was run on.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	Our model architecture was as follows: The ﬁrst layer was a 650 10, 000 linear layer without bias to convert the one-hot vector input into a dense vector, and we multiplied the output of the ﬁrst layer by 0.01 because our method assumes small inputs. The second layer was a GRU layer with 650 units, and we used the softmax function as the output layer. We applied 50 % dropout to the output of each layer except for the recurrent connection [38]. We unfolded the GRU for 35 time steps in BPTT and set the mini-batch size to 20. We trained the GRU with SGD for 75 epochs... We set the learning rate to one in the ﬁrst 10 epochs, and then, divided the learning rate by 1.1 after each epoch.