Capacity and Trainability in Recurrent Neural Networks

Authors: Jasmine Collins, Jascha Sohl-Dickstein, David Sussillo

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show experimentally that all common RNN architectures achieve nearly the same per-task and per-unit capacity bounds with careful training, for a variety of tasks and stacking depths.
Researcher Affiliation Industry Jasmine Collins , Jascha Sohl-Dickstein & David Sussillo Google Brain Google Inc. Mountain View, CA 94043, USA {jlcollins, jaschasd, sussillo}@google.com
Pseudocode No The paper provides mathematical equations for RNN architectures (RNN, UGRNN, GRU, LSTM, +RNN) but does not include structured pseudocode or algorithm blocks for the overall methodology or experimental process.
Open Source Code No The paper does not provide any explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes text8 1-step ahead character-based prediction on the text8 Wikipedia dataset (100 million characters) (Mahoney, 2011).
Dataset Splits Yes For all our tasks, we requested HPs from the tuner, and reported loss on a validation dataset. For the per-parameter capacity task, the evaluation, validation and training datasets were identical.
Hardware Specification No The paper mentions "CPU-millennia worth of computation" but does not provide specific details about the CPU models, GPU models, or any other hardware specifications used for the experiments.
Software Dependencies No The paper mentions using a "HP tuner that uses a Gaussian Process model similar to Spearmint" and various optimization algorithms like "vanilla SGD, SGD with momentum, RMSProp (Tieleman & Hinton, 2012), or ADAM (Kingma & Ba, 2014)". However, it does not specify version numbers for any of these software components.
Experiment Setup Yes For all our tasks, we requested HPs from the tuner, and reported loss on a validation dataset. For the per-parameter capacity task, the evaluation, validation and training datasets were identical. For text8, the validation and evaluation sets consisted of different sections of held out data. For all other tasks, evaluation, validation, and training sets were randomly drawn from the same distribution. The performance we plot in all cases is on the evaluation dataset. Below is the list of all tunable HPs that were generically applied to all models. In total, each RNN variant had between 10 and 27 HP dimensions relating to the architecture, optimization, and regularization. s() as used in the following RNN definitions, a nonlinearity determined by the HP tuner, {Re LU, tanh}. The only exception was the IRNN, which used Re LU exclusively. ... The number of training steps The exact range varied between tasks, but always fell between 50K and 20M. One of four optimization algorithms could be chosen: vanilla SGD, SGD with momentum, RMSProp (Tieleman & Hinton, 2012), or ADAM (Kingma & Ba, 2014). learning rate initial value, exponentially distributed in [1e 4, 1e 1] learning rate decay exponentially distributed in [1e 3, 1].