On the Provable Generalization of Recurrent Neural Networks
Authors: Lifu Wang, Bo Shen, Bo Hu, Xing Cao
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we analyze the training and generalization for RNNs with random initialization, and provide the following improvements over recent works: (1) For a RNN with input sequence x = (X1, X2, ..., XL), previous works study to learn functions that are summation of f(βT l Xl) and require normalized conditions that ||Xl|| ϵ with some very small ϵ depending on the complexity of f. In this paper, using detailed analysis about the neural tangent kernel matrix, we prove a generalization error bound to learn such functions without normalized conditions and show that some notable concept classes are learnable with the numbers of iterations and samples scaling almost-polynomially in the input length L. (2) Moreover, we prove a novel result to learn N-variables functions of input sequence with the form f(βT [Xl1, ..., Xl N ]), which do not belong to the additive concept class, i,e., the summation of function f(Xl). And we show that when either N or l0 = max(l1, .., l N) min(l1, .., l N) is small, f(βT [Xl1, ..., Xl N ]) will be learnable with the number iterations and samples scaling almost-polynomially in the input length L. |
| Researcher Affiliation | Academia | Lifu Wang Beijing Jiaotong University Lifu_Wang@bjtu.edu.cn Bo Shen Beijing Jiaotong University bshen@bjtu.edu.cn Bo Hu Beijing Jiaotong University hubo2018@bjtu.edu.cn Xing Cao Beijing Jiaotong University caoxing@bjtu.edu.cn |
| Pseudocode | Yes | Algorithm 1: Training RNN with SGD Input: Data set D, learning rate η. The entries of W 0, A are i.i.d. generated from N(0, 2 m). The entries of B are i.i.d. generated from N(0, 1 m). for t = 1, 2, 3...n do Randomly sample (xt, yt) from the data set D. W t = W t 1 η W t 1ℓ(yt f(W t 1, xt)). end |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | No | Assume there is an unknown data set D = {x, y}. The inputs have the form x = (X1, X2, ...XL) (Rd)L. ||Xl|| O(1) for all 1 l L. For every input xi, there is a label yi = 1. |
| Dataset Splits | No | The paper does not provide specific dataset split information for training, validation, or testing. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | No | The paper does not contain specific experimental setup details such as concrete hyperparameter values or training configurations in the main text. It mentions learning rate and initialization within a theoretical context but not for a reproducible experiment. |