On the Convergence Rate of Training Recurrent Neural Networks
Authors: Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We show when the number of neurons is sufficiently large, meaning polynomial in the training data size and in L, then SGD is capable of minimizing the regression loss in the linear convergence rate. This gives theoretical evidence of how RNNs can memorize data. More importantly, in this paper we build general toolkits to analyze multi-layer networks with Re LU activations. For instance, we prove why Re LU activations can prevent exponential gradient explosion or vanishing, and build a perturbation theory to analyze first-order approximation of multi-layer networks. |
| Researcher Affiliation | Collaboration | Zeyuan Allen-Zhu Microsoft Research AI zeyuan@csail.mit.edu Yuanzhi Li Carnegie Mellon University yuanzhil@andrew.cmu.edu Zhao Song UT-Austin zhaos@utexas.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "Full version and future updates can be found on https://arxiv.org/abs/1810.12065." This link refers to the arXiv paper itself, not to any open-source code for the described methodology. No other statements about code availability are present. |
| Open Datasets | No | The paper is theoretical and does not report on empirical experiments using a specific public dataset. It refers to "training inputs" and "training sequences" as part of the theoretical model setup, but provides no concrete access information (link, DOI, citation) for a publicly available dataset. |
| Dataset Splits | No | The paper is theoretical and does not conduct experiments involving dataset splits. While it mentions "training data", there is no discussion of training/validation/test splits for empirical evaluation. |
| Hardware Specification | No | The paper is theoretical and does not describe any hardware used for running experiments. |
| Software Dependencies | No | The paper is theoretical and does not provide specific software dependency details with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe a concrete experimental setup, hyperparameters, or training configurations for empirical reproduction. It defines theoretical parameters for the mathematical analysis but these are not for an empirical setup. |