reproducibilityindex.ai

Bidirectional Attention Flow for Machine Comprehension

Authors: Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental evaluations show that our model achieves the state-of-the-art results in Stanford Question Answering Dataset (SQu AD) and CNN/Daily Mail cloze test. Our BIDAF model1 outperforms all previous approaches on the highly-competitive Stanford Question Answering Dataset (SQu AD) test set leaderboard at the time of submission. With a modiﬁcation to only the output layer, BIDAF achieves the state-of-the-art results on the CNN/Daily Mail cloze test. We also provide an in-depth ablation study of our model on the SQu AD development set, visualize the intermediate feature spaces in our model, and analyse its performance as compared to a more traditional language model for machine comprehension (Rajpurkar et al., 2016).
Researcher Affiliation	Collaboration	Minjoon Seo1 Aniruddha Kembhavi2 Ali Farhadi1,2 Hananneh Hajishirzi1 University of Washington1, Allen Institute for Artiﬁcial Intelligence2 {minjoon,ali,hannaneh}@cs.washington.edu, {anik}@allenai.org
Pseudocode	No	No pseudocode or clearly labeled algorithm blocks are present in the paper. The methodology is described using text and mathematical equations.
Open Source Code	Yes	Our code and interactive demo are available at: allenai.github.io/bi-att-flow/
Open Datasets	Yes	We evaluate the performance of our comprehension system on both SQu AD and CNN/Daily Mail datasets. ... Dataset. SQu AD is a machine comprehension dataset on a large set of Wikipedia articles, with more than 100,000 questions. The answer to each question is always a span in the context. ... Rajpurkar et al. (2016) released the Stanford Question Answering (SQu AD) dataset with over 100,000 questions. ... Dataset. In a cloze test, the reader is asked to ﬁll in words that have been removed from a passage, for measuring one s ability to comprehend text. Hermann et al. (2015) have recently compiled a massive Cloze-style comprehension dataset, consisting of 300k/4k/3k and 879k/65k/53k (train/dev/test) examples from CNN and Daily Mail news articles, respectively.
Dataset Splits	Yes	The dataset consists of 90k/10k train/dev question-context tuples with a large hidden test set. It is one of the largest available MC datasets with human-written questions and serves as a great test bed for our model.
Hardware Specification	Yes	The training process takes roughly 20 hours on a single Titan X GPU. ... The entire training process takes roughly 60 hours on eight Titan X GPUs.
Software Dependencies	No	No specific software dependencies with version numbers are provided. The paper mentions using 'Ada Delta (Zeiler, 2012) optimizer', 'PTB Tokenizer', 'GLOVE', and 'LSTM' without specifying their software versions.
Experiment Setup	Yes	We use 100 1D ﬁlters for CNN char embedding, each with a width of 5. The hidden state size (d) of the model is 100. The model has about 2.6 million parameters. We use the Ada Delta (Zeiler, 2012) optimizer, with a minibatch size of 60 and an initial learning rate of 0.5, for 12 epochs. A dropout (Srivastava et al., 2014) rate of 0.2 is used for the CNN, all LSTM layers, and the linear transformation before the softmax for the answers. ... We use a minibatch size of 48 and train for 8 epochs, with early stop when the accuracy on validation data starts to drop.