Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations
Authors: Behnam Neyshabur, Yuhuai Wu, Russ R. Salakhutdinov, Nati Srebro
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On several datasets that require capturing long-term dependency structure, we show that path-SGD can significantly improve trainability of Re LU RNNs compared to RNNs trained with SGD, even with various recently suggested initialization schemes. [...] We compare the performance of SGD vs. Path-SGD with/without κ(2). [...] We evaluate Path-SGD on the Sequential MNIST problem. Table 2, right column, reports test error rates achieved by RNN-Path compared to the previously published results. |
| Researcher Affiliation | Academia | Behnam Neyshabur Toyota Technological Institute at Chicago bneyshabur@ttic.edu Yuhuai Wu University of Toronto ywu@cs.toronto.edu Ruslan Salakhutdinov Carnegie Mellon University rsalakhu@cs.cmu.edu Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu |
| Pseudocode | No | The paper provides mathematical derivations and descriptions of algorithms, but it does not include a dedicated section or figure labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | No | The paper thanks Saizheng Zhang for sharing a base code for RNNs, but it does not state that the code developed for this paper is open-source or provide a link. |
| Open Datasets | Yes | We train a single-layer RNN with H = 200 hidden units for the task of word-level language modeling on Penn Treebank (PTB) Corpus [13]. [...] For both tasks, we closely follow the experimental protocol in [12]. We train a single-layer RNN consisting of 100 hidden units with path-SGD, referred to as RNN-Path. [...] Next, we evaluate Path-SGD on the Sequential MNIST problem. |
| Dataset Splits | Yes | We use the standard split (929k training, 73k validation and 82k test) and the vocabulary size of 10k words. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Adam optimizer [8]" and "most deep learning libraries", but it does not specify any software versions for these or other dependencies. |
| Experiment Setup | Yes | We initialize the weights by sampling from the uniform distribution with range [ 0.1, 0.1]. The plots compare the training and test errors using a mini-batch of size 32 and backpropagating through T = 20 time steps and using a mini-batch of size 32 where the step-size is chosen by a grid search. [...] We performed grid search for the learning rates over {10 2, 10 3, 10 4} for both our model and the baseline. Non-recurrent weights were initialized from the uniform distribution with range [ 0.01, 0.01]. [...] For Re LU RNNs, we initialize the recurrent matrices from uniform[ 0.01, 0.01], and uniform[ 0.2, 0.2] for nonrecurrent weights. For LSTMs, we use orthogonal initialization [21] for the recurrent matrices and uniform[ 0.01, 0.01] for non-recurrent weights. |