Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization

Authors: Jiong Zhang, Qi Lei, Inderjit Dhillon

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experimental results also demonstrate that the proposed framework converges faster, and has good generalization, especially in capturing long range dependencies, as shown on the synthetic addition and copy tasks, as well as on the MNIST and Penn Tree Bank data sets.
Researcher Affiliation Collaboration 1University of Texas at Austin 2Amazon.com.
Pseudocode Yes Algorithm 2 and 3 (see Appendix B)
Open Source Code Yes The source code is available at https://github.com/ zhangjiong724/spectral-RNN
Open Datasets Yes The 28 28 MNIST pixels are flattened into a vector and then traversed by the RNN models. Table 2 shows test accuracy across multiple models. Spectral-RNN reaches the highest 97.7% accuracy on pixel-MNIST with only 128 hidden dimensions and 6k parameters. We tested different models on Penn Tree Bank (PTB) (Marcus et al., 1993) dataset for word-level prediction tasks.
Dataset Splits Yes The dataset was split into a training set of 60000 instances and a test set of 10000 instances. The dataset contains 929k training words, 73k validation words, and 82k test words with 10k vocabulary.
Hardware Specification No The paper mentions "exploit GPU computing power" but does not specify any particular GPU models, CPU models, or other hardware specifications like memory or specific cloud/cluster configurations used for the experiments.
Software Dependencies No These models are implemented with Tensorflow (Abadi et al., 2015). We applied Adam optimizer with stochastic gradient descent (Kingma & Ba, 2014). The paper mentions TensorFlow and Adam optimizer but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We performed a grid search over several learning rates ρ = {0.1, 0.01, 0.001, 0.0001}, decay rate α = {0.9, 0.8, 0.5} and batch size B = {64, 128, 256, 512}. The reported results are the best one among them. We use initial learning rate of 0.1 and decay by factor of 0.8 at each epoch, and 80% dropout is applied on 2-layered models.