On the Provable Generalization of Recurrent Neural Networks

Authors: Lifu Wang, Bo Shen, Bo Hu, Xing Cao

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we analyze the training and generalization for RNNs with random initialization, and provide the following improvements over recent works: (1) For a RNN with input sequence x = (X1, X2, ..., XL), previous works study to learn functions that are summation of f(βT l Xl) and require normalized conditions that ||Xl|| ϵ with some very small ϵ depending on the complexity of f. In this paper, using detailed analysis about the neural tangent kernel matrix, we prove a generalization error bound to learn such functions without normalized conditions and show that some notable concept classes are learnable with the numbers of iterations and samples scaling almost-polynomially in the input length L. (2) Moreover, we prove a novel result to learn N-variables functions of input sequence with the form f(βT [Xl1, ..., Xl N ]), which do not belong to the additive concept class, i,e., the summation of function f(Xl). And we show that when either N or l0 = max(l1, .., l N) min(l1, .., l N) is small, f(βT [Xl1, ..., Xl N ]) will be learnable with the number iterations and samples scaling almost-polynomially in the input length L.
Researcher Affiliation Academia Lifu Wang Beijing Jiaotong University Lifu_Wang@bjtu.edu.cn Bo Shen Beijing Jiaotong University bshen@bjtu.edu.cn Bo Hu Beijing Jiaotong University hubo2018@bjtu.edu.cn Xing Cao Beijing Jiaotong University caoxing@bjtu.edu.cn
Pseudocode Yes Algorithm 1: Training RNN with SGD Input: Data set D, learning rate η. The entries of W 0, A are i.i.d. generated from N(0, 2 m). The entries of B are i.i.d. generated from N(0, 1 m). for t = 1, 2, 3...n do Randomly sample (xt, yt) from the data set D. W t = W t 1 η W t 1ℓ(yt f(W t 1, xt)). end
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets No Assume there is an unknown data set D = {x, y}. The inputs have the form x = (X1, X2, ...XL) (Rd)L. ||Xl|| O(1) for all 1 l L. For every input xi, there is a label yi = 1.
Dataset Splits No The paper does not provide specific dataset split information for training, validation, or testing.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup No The paper does not contain specific experimental setup details such as concrete hyperparameter values or training configurations in the main text. It mentions learning rate and initialization within a theoretical context but not for a reproducible experiment.