Recurrent Orthogonal Networks and Long-Memory Tasks

Authors: Mikael Henaff, Arthur Szlam, Yann LeCun

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we carefully analyze two synthetic datasets originally outlined in (Hochreiter & Schmidhuber, 1997) which are used to evaluate the ability of RNNs to store information over many time steps. ... We verify experimentally that initializing correctly (i.e. random orthogonal or identity) is critical for success on these tasks. ... 4. Experiments
Researcher Affiliation Collaboration Mikael Henaff MBH305@NYU.EDU New York University, Facebook AI Research Arthur Szlam ASZLAM@FACEBOOK.COM Facebook AI Research Yann Le Cun YANN@CS.NYU.EDU New York University, Facebook AI Research
Pseudocode No The paper describes models and solutions using prose and mathematical equations, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code or provide a link to a code repository for the described methodology.
Open Datasets Yes In this work, we carefully analyze two synthetic datasets originally outlined in (Hochreiter & Schmidhuber, 1997)
Dataset Splits No The paper mentions training networks and evaluating performance but does not specify details regarding training/validation/test dataset splits (e.g., percentages, sample counts, or cross-validation setup).
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., "Python 3.8", "PyTorch 1.9") needed to replicate the experiment.
Experiment Setup Yes In all experiments, we used RMSProp to train our networks with a fixed learning rate and a decay rate of 0.9. In preliminary experiments we tried different learning rates in {1, 10-1, 10-2, 10-3, 10-4, 10-5} and chose the largest one for which the loss did not diverge, for the LT-RNN s we used 10-4. ... We also included LSTMs in all our experiments as a baseline. We used the same method as for LT-RNN to pick the learning rate, and ended up with 10-3. ... For all experiments, we normalized the gradients with respect to hidden activations by 1/T, where T denotes the number of timesteps. ... we adopted a simple activation clipping strategy where we rescaled activations to to have magnitude l whenever their magnitude exceeded l. In our experiments we chose l = 1000. ... All networks are trained with 80 hidden units. ... All networks are trained with 128 hidden units. ... We added a soft penalty on the transition matrix V to keep it orthogonal throughout training. Specifically, at every iteration we applied one step of stochastic gradient descent to minimize the loss ||V T V I||, evaluated at m random points on the unit sphere. ... In our experiments we set m = 50, which was the same at the minibatch size. ... In all pooling experiments we used a pool size and stride of 2.