reproducibilityindex.ai

Recurrent Orthogonal Networks and Long-Memory Tasks

Authors: Mikael Henaff, Arthur Szlam, Yann LeCun

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we carefully analyze two synthetic datasets originally outlined in (Hochreiter & Schmidhuber, 1997) which are used to evaluate the ability of RNNs to store information over many time steps. ... We verify experimentally that initializing correctly (i.e. random orthogonal or identity) is critical for success on these tasks. ... 4. Experiments
Researcher Affiliation	Collaboration	Mikael Henaff MBH305@NYU.EDU New York University, Facebook AI Research Arthur Szlam ASZLAM@FACEBOOK.COM Facebook AI Research Yann Le Cun YANN@CS.NYU.EDU New York University, Facebook AI Research
Pseudocode	No	The paper describes models and solutions using prose and mathematical equations, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or provide a link to a code repository for the described methodology.
Open Datasets	Yes	In this work, we carefully analyze two synthetic datasets originally outlined in (Hochreiter & Schmidhuber, 1997)
Dataset Splits	No	The paper mentions training networks and evaluating performance but does not specify details regarding training/validation/test dataset splits (e.g., percentages, sample counts, or cross-validation setup).
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., "Python 3.8", "PyTorch 1.9") needed to replicate the experiment.
Experiment Setup	Yes	In all experiments, we used RMSProp to train our networks with a fixed learning rate and a decay rate of 0.9. In preliminary experiments we tried different learning rates in {1, 10-1, 10-2, 10-3, 10-4, 10-5} and chose the largest one for which the loss did not diverge, for the LT-RNN s we used 10-4. ... We also included LSTMs in all our experiments as a baseline. We used the same method as for LT-RNN to pick the learning rate, and ended up with 10-3. ... For all experiments, we normalized the gradients with respect to hidden activations by 1/T, where T denotes the number of timesteps. ... we adopted a simple activation clipping strategy where we rescaled activations to to have magnitude l whenever their magnitude exceeded l. In our experiments we chose l = 1000. ... All networks are trained with 80 hidden units. ... All networks are trained with 128 hidden units. ... We added a soft penalty on the transition matrix V to keep it orthogonal throughout training. Specifically, at every iteration we applied one step of stochastic gradient descent to minimize the loss \|\|V T V I\|\|, evaluated at m random points on the unit sphere. ... In our experiments we set m = 50, which was the same at the minibatch size. ... In all pooling experiments we used a pool size and stride of 2.