Densely Connected Attention Propagation for Reading Comprehension

Authors: Yi Tay, Anh Tuan Luu, Siu Cheung Hui, Jian Su

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on four challenging RC benchmarks. Our proposed approach achieves state-of-the-art results on all four, outperforming existing baselines by up to 2.6% 14.2% in absolute F1 score.
Researcher Affiliation Collaboration 1,3Nanyang Technological University, Singapore 2,4Institute for Infocomm Research, Singapore
Pseudocode No The paper describes the methodology using text and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about the release or availability of source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We conduct experiments on four challenging QA datasets which are described as follows: News QA [Trischler et al., 2016], Quasar-T [Dhingra et al., 2017], Search QA [Dunn et al., 2017] and Narrative QA [Koˇcisk y et al., 2017]. As compared to the popular SQu AD dataset [Rajpurkar et al., 2016]...
Dataset Splits Yes We conduct experiments on four challenging QA datasets which are described as follows: News QA [Trischler et al., 2016], Quasar-T [Dhingra et al., 2017], Search QA [Dunn et al., 2017] and Narrative QA [Koˇcisk y et al., 2017]. ... The evaluation metrics are the EM (exact match) and F1 score. Note that for all datasets, we compare all models solely on the RC task. ... Finally, to ensure that our model is not a failing case of SQu AD, and as requested by reviewers, we also include development set scores of our model on SQu AD. (See Tables 1-6 which show 'Dev' and 'Test' columns for all datasets).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions the use of TensorFlow and CUDNN, implying GPU usage without specification.
Software Dependencies No The paper mentions implementation in Tensorflow [Abadi et al., 2015] and usage of CUDNN implementation for RNN, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Our model is implemented in Tensorflow [Abadi et al., 2015]. The sequence lengths are capped at 800/700/1500/1100 for News QA, Search QA, Quasar-T and Narrative QA respectively. We use Adadelta [Zeiler, 2012] with α = 0.5 for News QA, Adam [Kingma and Ba, 2014] with α = 0.001 for Search QA, Quasar-T and Narrative QA. The choice of the RNN encoder is tuned between GRU and LSTM cells and the hidden size is tuned amongst {32, 50, 64, 75}. We use the CUDNN implementation of the RNN encoder. Batch size is tuned amongst {16, 32, 64}. Dropout rate is tuned amongst {0.1, 0.2, 0.3} and applied to all RNN and fully-connected layers. We apply variational dropout [Gal and Ghahramani, 2016] in-between RNN layers. We initialize the word embeddings with 300D Glo Ve embeddings [Pennington et al., 2014] and are fixed during training. The size of the character embeddings is set to 8 and the character RNN is set to the same as the word-level RNN encoders. The maximum characters per word is set to 16. The number of layers in DECAENC is set to 3 and the number of factors in the factorization kernel is set to 64. We use a learning rate decay factor of 2 and patience of 3 epochs whenever the EM (or ROUGE-L) score on the development set does not increase.