On the Convergence Rate of Training Recurrent Neural Networks

Authors: Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We show when the number of neurons is sufficiently large, meaning polynomial in the training data size and in L, then SGD is capable of minimizing the regression loss in the linear convergence rate. This gives theoretical evidence of how RNNs can memorize data. More importantly, in this paper we build general toolkits to analyze multi-layer networks with Re LU activations. For instance, we prove why Re LU activations can prevent exponential gradient explosion or vanishing, and build a perturbation theory to analyze first-order approximation of multi-layer networks.
Researcher Affiliation Collaboration Zeyuan Allen-Zhu Microsoft Research AI zeyuan@csail.mit.edu Yuanzhi Li Carnegie Mellon University yuanzhil@andrew.cmu.edu Zhao Song UT-Austin zhaos@utexas.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper states: "Full version and future updates can be found on https://arxiv.org/abs/1810.12065." This link refers to the arXiv paper itself, not to any open-source code for the described methodology. No other statements about code availability are present.
Open Datasets No The paper is theoretical and does not report on empirical experiments using a specific public dataset. It refers to "training inputs" and "training sequences" as part of the theoretical model setup, but provides no concrete access information (link, DOI, citation) for a publicly available dataset.
Dataset Splits No The paper is theoretical and does not conduct experiments involving dataset splits. While it mentions "training data", there is no discussion of training/validation/test splits for empirical evaluation.
Hardware Specification No The paper is theoretical and does not describe any hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not provide specific software dependency details with version numbers.
Experiment Setup No The paper is theoretical and does not describe a concrete experimental setup, hyperparameters, or training configurations for empirical reproduction. It defines theoretical parameters for the mathematical analysis but these are not for an empirical setup.