Bidirectional Attention Flow for Machine Comprehension
Authors: Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental evaluations show that our model achieves the state-of-the-art results in Stanford Question Answering Dataset (SQu AD) and CNN/Daily Mail cloze test. Our BIDAF model1 outperforms all previous approaches on the highly-competitive Stanford Question Answering Dataset (SQu AD) test set leaderboard at the time of submission. With a modification to only the output layer, BIDAF achieves the state-of-the-art results on the CNN/Daily Mail cloze test. We also provide an in-depth ablation study of our model on the SQu AD development set, visualize the intermediate feature spaces in our model, and analyse its performance as compared to a more traditional language model for machine comprehension (Rajpurkar et al., 2016). |
| Researcher Affiliation | Collaboration | Minjoon Seo1 Aniruddha Kembhavi2 Ali Farhadi1,2 Hananneh Hajishirzi1 University of Washington1, Allen Institute for Artificial Intelligence2 {minjoon,ali,hannaneh}@cs.washington.edu, {anik}@allenai.org |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks are present in the paper. The methodology is described using text and mathematical equations. |
| Open Source Code | Yes | Our code and interactive demo are available at: allenai.github.io/bi-att-flow/ |
| Open Datasets | Yes | We evaluate the performance of our comprehension system on both SQu AD and CNN/Daily Mail datasets. ... Dataset. SQu AD is a machine comprehension dataset on a large set of Wikipedia articles, with more than 100,000 questions. The answer to each question is always a span in the context. ... Rajpurkar et al. (2016) released the Stanford Question Answering (SQu AD) dataset with over 100,000 questions. ... Dataset. In a cloze test, the reader is asked to fill in words that have been removed from a passage, for measuring one s ability to comprehend text. Hermann et al. (2015) have recently compiled a massive Cloze-style comprehension dataset, consisting of 300k/4k/3k and 879k/65k/53k (train/dev/test) examples from CNN and Daily Mail news articles, respectively. |
| Dataset Splits | Yes | The dataset consists of 90k/10k train/dev question-context tuples with a large hidden test set. It is one of the largest available MC datasets with human-written questions and serves as a great test bed for our model. |
| Hardware Specification | Yes | The training process takes roughly 20 hours on a single Titan X GPU. ... The entire training process takes roughly 60 hours on eight Titan X GPUs. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided. The paper mentions using 'Ada Delta (Zeiler, 2012) optimizer', 'PTB Tokenizer', 'GLOVE', and 'LSTM' without specifying their software versions. |
| Experiment Setup | Yes | We use 100 1D filters for CNN char embedding, each with a width of 5. The hidden state size (d) of the model is 100. The model has about 2.6 million parameters. We use the Ada Delta (Zeiler, 2012) optimizer, with a minibatch size of 60 and an initial learning rate of 0.5, for 12 epochs. A dropout (Srivastava et al., 2014) rate of 0.2 is used for the CNN, all LSTM layers, and the linear transformation before the softmax for the answers. ... We use a minibatch size of 48 and train for 8 epochs, with early stop when the accuracy on validation data starts to drop. |