Quasi-Recurrent Neural Networks

Authors: James Bradbury, Stephen Merity, Caiming Xiong, Richard Socher

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of the QRNN on three different natural language tasks: document-level sentiment classification, language modeling, and character-based neural machine translation. Our QRNN models outperform LSTM-based models of equal hidden size on all three tasks while dramatically improving computation speed. Experiments were implemented in Chainer (Tokui et al.).
Researcher Affiliation Industry James Bradbury , Stephen Merity , Caiming Xiong & Richard Socher Salesforce Research Palo Alto, California {james.bradbury,smerity,cxiong,rsocher}@salesforce.com
Pseudocode No The paper provides mathematical equations and block diagrams (Figure 1), but no structured pseudocode or algorithm blocks are present.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We evaluate the QRNN architecture on a popular document-level sentiment classification benchmark, the IMDb movie review dataset (Maas et al., 2011). We replicate the language modeling experiment of Zaremba et al. (2014) and Gal & Ghahramani (2016) to benchmark the QRNN architecture for natural language sequence prediction. The experiment uses a standard preprocessed version of the Penn Treebank (PTB) by Mikolov et al. (2010). We evaluate the sequence-to-sequence QRNN architecture described in 2.1 on a challenging neural machine translation task, IWSLT German English spoken-domain translation...
Dataset Splits Yes The dataset consists of a balanced sample of 25,000 positive and 25,000 negative reviews, divided into equal-size train and test sets, with an average document length of 231 words (Wang & Manning, 2012). Our best performance on a held-out development set was achieved using a four-layer denselyconnected QRNN with 256 units per layer...
Hardware Specification Yes When training on the PTB dataset with an NVIDIA K40 GPU, we found that the QRNN is substantially faster than a standard LSTM, even when comparing against the optimized cu DNN LSTM.
Software Dependencies No Experiments were implemented in Chainer (Tokui et al.). We observed a speedup of 3.2x on IMDb train time per epoch compared to the optimized LSTM implementation provided in NVIDIA s cu DNN library. No version numbers are provided for Chainer or cuDNN.
Experiment Setup Yes Dropout of 0.3 was applied between layers, and we used L2 regularization of 4 10 6. Optimization was performed on minibatches of 24 examples using RMSprop (Tieleman & Hinton, 2012) with learning rate of 0.001, α = 0.9, and ϵ = 10 8. The learning rate was set at 1 for six epochs, then decayed by 0.95 for each subsequent epoch, for a total of 72 epochs. We additionally used L2 regularization of 2 10 4 and rescaled gradients with norm above 10. Zoneout was applied by performing dropout with ratio 0.1 on the forget gates of the QRNN, without rescaling the output of the dropout function. Batches consist of 20 examples, each 105 timesteps. Optimization was performed for 10 epochs on minibatches of 16 examples using Adam (Kingma & Ba, 2014) with α = 0.001, β1 = 0.9, β2 = 0.999, and ϵ = 10 8.