reproducibilityindex.ai

Translucent Answer Predictions in Multi-Hop Reading Comprehension

Authors: G P Shrivatsa Bhargav, Michael Glass, Dinesh Garg, Shirish Shevade, Saswati Dana, Dinesh Khandelwal, L Venkata Subramaniam, Alfio Gliozzo7700-7707

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	TAP offers state-of-the-art performance on the Hotpot QA (Yang et al. 2018) dataset an apt dataset for multi-hop RCQA task as it occupies Rank-1 on its leaderboard (https://hotpotqa.github.io/) at the time of submission. and 6 Experiments Hotpot QA is a large scale QA dataset focusing on explainability and multi-hop reasoning. ... We evaluate TAP on the hidden test set for the distractor setting of Hotpot QA by submitting our system for evaluation. We also use the publicly available development set to explore the impact of decisions in our architecture. and Table 2: Performance of TAP (ours) in comparison with the next closest and closest published models on the Hotpot QA leader board.
Researcher Affiliation	Collaboration	1IBM Research AI, 2Dept. of CSA, IISc, Bangalore {mrglass, gliozzo}@us.ibm.com, {bhargavs, shirish}@iisc.ac.in, {garg.dinesh, sadana04, dikhand1, lvsubram}@in.ibm.com
Pseudocode	No	The paper provides architectural diagrams and descriptive text for its components, but no explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The TAP code repository can be found at https://github.com/IBM/translucent-answer-prediction.
Open Datasets	Yes	Hotpot QA (Yang et al. 2018) dataset an apt dataset for multi-hop RCQA task as it occupies Rank-1 on its leaderboard (https://hotpotqa.github.io/) at the time of submission. and Hotpot QA is a large scale QA dataset focusing on explainability and multi-hop reasoning. This dataset comes with human annotated sentence level binary labels indicating which sentences are supporting facts for answering a given question.
Dataset Splits	Yes	Hotpot QA (Yang et al. 2018) dataset and Table 1 shows some statistics on the training and development sets. and We also use the publicly available development set to explore the impact of decisions in our architecture.
Hardware Specification	Yes	Training Lo GIX took approximately 24 hours on 8 P100 GPUs. In the joint setting this training was done for each of the ﬁve folds. The Answer Predictor takes under 10 hours to train on 4 P100 GPUs.
Software Dependencies	No	The paper mentions 'Py Torch was used to develop TAP' but does not provide a specific version number for PyTorch or any other software dependencies.
Experiment Setup	Yes	We use pre-trained BERTLARGE models. In the Global Layer of Lo GIX, there are two transformer layers. For both networks we use the ADAM (Kingma and Ba 2015) optimizer with a maximum learning rate of 3 10 5 and a triangular learning schedule, warming up over the ﬁrst 10% of training instances. Questions are truncated to 35 tokens and passages are truncated to 512 tokens. The total length of the passage set is limited to 2048 tokens, with the longest passages truncated to ﬁt. We trained Lo GIX for 4 epochs with a batch size of 8 and the Answer Predictor also for 4 epochs with a batch size of 16.