reproducibilityindex.ai

Densely Connected Attention Propagation for Reading Comprehension

Authors: Yi Tay, Anh Tuan Luu, Siu Cheung Hui, Jian Su

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on four challenging RC benchmarks. Our proposed approach achieves state-of-the-art results on all four, outperforming existing baselines by up to 2.6% 14.2% in absolute F1 score.
Researcher Affiliation	Collaboration	1,3Nanyang Technological University, Singapore 2,4Institute for Infocomm Research, Singapore
Pseudocode	No	The paper describes the methodology using text and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about the release or availability of source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	We conduct experiments on four challenging QA datasets which are described as follows: News QA [Trischler et al., 2016], Quasar-T [Dhingra et al., 2017], Search QA [Dunn et al., 2017] and Narrative QA [Koˇcisk y et al., 2017]. As compared to the popular SQu AD dataset [Rajpurkar et al., 2016]...
Dataset Splits	Yes	We conduct experiments on four challenging QA datasets which are described as follows: News QA [Trischler et al., 2016], Quasar-T [Dhingra et al., 2017], Search QA [Dunn et al., 2017] and Narrative QA [Koˇcisk y et al., 2017]. ... The evaluation metrics are the EM (exact match) and F1 score. Note that for all datasets, we compare all models solely on the RC task. ... Finally, to ensure that our model is not a failing case of SQu AD, and as requested by reviewers, we also include development set scores of our model on SQu AD. (See Tables 1-6 which show 'Dev' and 'Test' columns for all datasets).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions the use of TensorFlow and CUDNN, implying GPU usage without specification.
Software Dependencies	No	The paper mentions implementation in Tensorﬂow [Abadi et al., 2015] and usage of CUDNN implementation for RNN, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	Our model is implemented in Tensorﬂow [Abadi et al., 2015]. The sequence lengths are capped at 800/700/1500/1100 for News QA, Search QA, Quasar-T and Narrative QA respectively. We use Adadelta [Zeiler, 2012] with α = 0.5 for News QA, Adam [Kingma and Ba, 2014] with α = 0.001 for Search QA, Quasar-T and Narrative QA. The choice of the RNN encoder is tuned between GRU and LSTM cells and the hidden size is tuned amongst {32, 50, 64, 75}. We use the CUDNN implementation of the RNN encoder. Batch size is tuned amongst {16, 32, 64}. Dropout rate is tuned amongst {0.1, 0.2, 0.3} and applied to all RNN and fully-connected layers. We apply variational dropout [Gal and Ghahramani, 2016] in-between RNN layers. We initialize the word embeddings with 300D Glo Ve embeddings [Pennington et al., 2014] and are ﬁxed during training. The size of the character embeddings is set to 8 and the character RNN is set to the same as the word-level RNN encoders. The maximum characters per word is set to 16. The number of layers in DECAENC is set to 3 and the number of factors in the factorization kernel is set to 64. We use a learning rate decay factor of 2 and patience of 3 epochs whenever the EM (or ROUGE-L) score on the development set does not increase.