Recurrently Controlled Recurrent Networks
Authors: Yi Tay, Anh Tuan Luu, Siu Cheung Hui
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on a myriad of tasks in the NLP domain such as sentiment analysis (SST, IMDb, Amazon reviews, etc.), question classification (TREC), entailment classification (SNLI, Sci Tail), answer selection (Wiki QA, Trec QA) and reading comprehension (Narrative QA). Across all 26 datasets, our results demonstrate that RCRN not only consistently outperforms Bi LSTMs but also stacked Bi LSTMs, suggesting that our controller architecture might be a suitable replacement for the widely adopted stacked architecture. |
| Researcher Affiliation | Academia | Yi Tay1, Luu Anh Tuan2, and Siu Cheung Hui3 1,3Nanyang Technological University 2Institute for Infocomm Research ytay017@ntu.edu.sg1 at.luu@i2r.a-star.edu.sg2 asschui@ntu.edu.sg3 |
| Pseudocode | No | The paper provides mathematical equations (1-16) to describe the model architecture but does not include any blocks explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The source code of our model can be found at https://github.com/ vanzytay/NIPS2018_RCRN. |
| Open Datasets | Yes | We conduct extensive experiments on a myriad of tasks in the NLP domain such as sentiment analysis (SST, IMDb, Amazon reviews, etc.), question classification (TREC), entailment classification (SNLI, Sci Tail), answer selection (Wiki QA, Trec QA) and reading comprehension (Narrative QA). More concretely, we use 16 Amazon review datasets from [Liu et al., 2017], the well-established Stanford Sentiment Tree Bank (SST-5/SST-2) [Socher et al., 2013] and the IMDb Sentiment dataset [Maas et al., 2011]. ... We use the TREC question classification dataset [Voorhees et al., 1999]. ... We use two popular benchmark datasets, i.e., the Stanford Natural Language Inference (SNLI) corpus [Bowman et al., 2015], and Sci Tail (Science Entailment) [Khot et al., 2018] datasets. ... We use the popular Wiki QA [Yang et al., 2015] and Trec QA [Wang et al., 2007] datasets. ... We use the recent Narrative QA [Koˇcisk y et al., 2017] dataset... |
| Dataset Splits | No | The paper mentions using standard benchmark datasets and tuning hyperparameters, which implies the use of validation sets (e.g., 'learning rate is tuned amongst {0.001, 0.0003, 0.0004}'), but it does not explicitly provide the specific percentages or sample counts for training, validation, and test splits for any of the datasets used. |
| Hardware Specification | Yes | We use the same standard hardware (a single Nvidia GTX1070 card) and an identical overarching model architecture. |
| Software Dependencies | No | The paper mentions using 'CUDNN optimized version' and states 'We adapt the CUDA kernel as a custom Tensorflow op in our experiments', but it does not provide specific version numbers for TensorFlow, CUDNN, or CUDA. |
| Experiment Setup | Yes | In this section, we describe the task-specific model architectures for each task. Classification Model ... We use 300D Glo Ve [Pennington et al., 2014] vectors with 600D Co Ve [Mc Cann et al., 2017] vectors as pretrained embedding vectors. ... The output of the embedding layer is passed into the RCRN model directly ... Word embeddings are not updated during training. Given the hidden output states of the 200d dimensional RCRN cell, we take the concatenation of the max, mean and min pooling of all hidden states to form the final feature vector. This feature vector is passed into a single dense layer with Re LU activations of 200d dimensions. The output of this layer is then passed into a softmax layer for classification. This model optimizes the cross entropy loss. We train this model using Adam [Kingma and Ba, 2014] and learning rate is tuned amongst {0.001, 0.0003, 0.0004}. Entailment Model ... two layer highway network [Srivastava et al., 2015] of 300 hidden dimensions ... We train this model using Adam and learning rate is tuned amongst {0.001, 0.0003, 0.0004}. Ranking Model ... The dimensionality is set to 200. The similarity scoring function is the cosine similarity and the objective function is the pairwise hinge loss with a margin of 0.1. We use negative sampling of n = 6 to train our model. We train our model using Adadelta [Zeiler, 2012] with a learning rate of 0.2. Reading Comprehension Model ... The dimensionality of the encoder is set to 75. We train both models using Adam with a learning rate of 0.001. |