AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks

Authors: Bo Chang, Minmin Chen, Eldad Haber, Ed H. Chi

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive simulations and experiments to demonstrate the benefits of this new RNN architecture. Antisymmetric RNN exhibits well-behaved dynamics and outperforms the regular LSTM model on tasks requiring long-term memory, and matches its performance on tasks where short-term dependencies dominate with much fewer parameters. 5 EXPERIMENTS The performance of the proposed antisymmetric networks is evaluated on four image classification tasks with long-range dependencies.
Researcher Affiliation Collaboration Bo Chang University of British Columbia Vancouver, BC, Canada bchang@stat.ubc.ca Minmin Chen Google Brain Mountain View, CA, USA minminc@google.com Eldad Haber University of British Columbia Vancouver, BC, Canada haber@math.ubc.ca Ed H. Chi Google Brain Mountain View, CA, USA edchi@google.com
Pseudocode No
Open Source Code No
Open Datasets Yes 5.1 PIXEL-BY-PIXEL MNIST In the first task, we learn to classify the MNIST digits by pixels (Le Cun et al., 1998). 5.2 PIXEL-BY-PIXEL CIFAR-10 The CIFAR-10 dataset contains 32 32 colour images in 10 classes (Krizhevsky & Hinton, 2009).
Dataset Splits Yes We use the standard train/test split of MNIST and CIFAR-10.
Hardware Specification No
Software Dependencies No
Experiment Setup Yes C EXPERIMENTAL DETAILS Let m be the input dimension and n be the number of hidden units. The input to hidden matrices are initialized to N(0, 1/m). The hidden to hidden matrices are initialized to N(0, σ2 w/n), where σw is chosen from σw {0, 1, 2, 4, 8, 16}. The bias terms are initialized to zero, except the forget gate bias of LSTM is initialized to 1, as suggested by Jozefowicz et al. (2015). For Antisymmetric RNNs, the step size ϵ {0.01, 0.1, 1} and diffusion γ {0.001, 0.01, 0.1, 1.0}. We use SGD with momentum and Adagrad (Duchi et al., 2011) as optimizers, with batch size of 128 and learning rate chosen from {0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1}. On MNIST and pixel-by-pixel CIFAR-10, all the models are trained for 50,000 iterations. On noise padded CIFAR-10, models are trained for 10,000 iterations.