reproducibilityindex.ai

AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks

Authors: Bo Chang, Minmin Chen, Eldad Haber, Ed H. Chi

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive simulations and experiments to demonstrate the beneﬁts of this new RNN architecture. Antisymmetric RNN exhibits well-behaved dynamics and outperforms the regular LSTM model on tasks requiring long-term memory, and matches its performance on tasks where short-term dependencies dominate with much fewer parameters. 5 EXPERIMENTS The performance of the proposed antisymmetric networks is evaluated on four image classiﬁcation tasks with long-range dependencies.
Researcher Affiliation	Collaboration	Bo Chang University of British Columbia Vancouver, BC, Canada bchang@stat.ubc.ca Minmin Chen Google Brain Mountain View, CA, USA minminc@google.com Eldad Haber University of British Columbia Vancouver, BC, Canada haber@math.ubc.ca Ed H. Chi Google Brain Mountain View, CA, USA edchi@google.com
Pseudocode	No
Open Source Code	No
Open Datasets	Yes	5.1 PIXEL-BY-PIXEL MNIST In the ﬁrst task, we learn to classify the MNIST digits by pixels (Le Cun et al., 1998). 5.2 PIXEL-BY-PIXEL CIFAR-10 The CIFAR-10 dataset contains 32 32 colour images in 10 classes (Krizhevsky & Hinton, 2009).
Dataset Splits	Yes	We use the standard train/test split of MNIST and CIFAR-10.
Hardware Specification	No
Software Dependencies	No
Experiment Setup	Yes	C EXPERIMENTAL DETAILS Let m be the input dimension and n be the number of hidden units. The input to hidden matrices are initialized to N(0, 1/m). The hidden to hidden matrices are initialized to N(0, σ2 w/n), where σw is chosen from σw {0, 1, 2, 4, 8, 16}. The bias terms are initialized to zero, except the forget gate bias of LSTM is initialized to 1, as suggested by Jozefowicz et al. (2015). For Antisymmetric RNNs, the step size ϵ {0.01, 0.1, 1} and diffusion γ {0.001, 0.01, 0.1, 1.0}. We use SGD with momentum and Adagrad (Duchi et al., 2011) as optimizers, with batch size of 128 and learning rate chosen from {0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1}. On MNIST and pixel-by-pixel CIFAR-10, all the models are trained for 50,000 iterations. On noise padded CIFAR-10, models are trained for 10,000 iterations.