Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization
Authors: Jiong Zhang, Qi Lei, Inderjit Dhillon
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experimental results also demonstrate that the proposed framework converges faster, and has good generalization, especially in capturing long range dependencies, as shown on the synthetic addition and copy tasks, as well as on the MNIST and Penn Tree Bank data sets. |
| Researcher Affiliation | Collaboration | 1University of Texas at Austin 2Amazon.com. |
| Pseudocode | Yes | Algorithm 2 and 3 (see Appendix B) |
| Open Source Code | Yes | The source code is available at https://github.com/ zhangjiong724/spectral-RNN |
| Open Datasets | Yes | The 28 28 MNIST pixels are flattened into a vector and then traversed by the RNN models. Table 2 shows test accuracy across multiple models. Spectral-RNN reaches the highest 97.7% accuracy on pixel-MNIST with only 128 hidden dimensions and 6k parameters. We tested different models on Penn Tree Bank (PTB) (Marcus et al., 1993) dataset for word-level prediction tasks. |
| Dataset Splits | Yes | The dataset was split into a training set of 60000 instances and a test set of 10000 instances. The dataset contains 929k training words, 73k validation words, and 82k test words with 10k vocabulary. |
| Hardware Specification | No | The paper mentions "exploit GPU computing power" but does not specify any particular GPU models, CPU models, or other hardware specifications like memory or specific cloud/cluster configurations used for the experiments. |
| Software Dependencies | No | These models are implemented with Tensorflow (Abadi et al., 2015). We applied Adam optimizer with stochastic gradient descent (Kingma & Ba, 2014). The paper mentions TensorFlow and Adam optimizer but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We performed a grid search over several learning rates ρ = {0.1, 0.01, 0.001, 0.0001}, decay rate α = {0.9, 0.8, 0.5} and batch size B = {64, 128, 256, 512}. The reported results are the best one among them. We use initial learning rate of 0.1 and decay by factor of 0.8 at each epoch, and 80% dropout is applied on 2-layered models. |