reproducibilityindex.ai

Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization

Authors: Jiong Zhang, Qi Lei, Inderjit Dhillon

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experimental results also demonstrate that the proposed framework converges faster, and has good generalization, especially in capturing long range dependencies, as shown on the synthetic addition and copy tasks, as well as on the MNIST and Penn Tree Bank data sets.
Researcher Affiliation	Collaboration	1University of Texas at Austin 2Amazon.com.
Pseudocode	Yes	Algorithm 2 and 3 (see Appendix B)
Open Source Code	Yes	The source code is available at https://github.com/ zhangjiong724/spectral-RNN
Open Datasets	Yes	The 28 28 MNIST pixels are ﬂattened into a vector and then traversed by the RNN models. Table 2 shows test accuracy across multiple models. Spectral-RNN reaches the highest 97.7% accuracy on pixel-MNIST with only 128 hidden dimensions and 6k parameters. We tested different models on Penn Tree Bank (PTB) (Marcus et al., 1993) dataset for word-level prediction tasks.
Dataset Splits	Yes	The dataset was split into a training set of 60000 instances and a test set of 10000 instances. The dataset contains 929k training words, 73k validation words, and 82k test words with 10k vocabulary.
Hardware Specification	No	The paper mentions "exploit GPU computing power" but does not specify any particular GPU models, CPU models, or other hardware specifications like memory or specific cloud/cluster configurations used for the experiments.
Software Dependencies	No	These models are implemented with Tensorﬂow (Abadi et al., 2015). We applied Adam optimizer with stochastic gradient descent (Kingma & Ba, 2014). The paper mentions TensorFlow and Adam optimizer but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We performed a grid search over several learning rates ρ = {0.1, 0.01, 0.001, 0.0001}, decay rate α = {0.9, 0.8, 0.5} and batch size B = {64, 128, 256, 512}. The reported results are the best one among them. We use initial learning rate of 0.1 and decay by factor of 0.8 at each epoch, and 80% dropout is applied on 2-layered models.