QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

Authors: Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, Quoc V. Le

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments on the SQu AD dataset, our model is 3x to 13x faster in training and 4x to 9x faster in inference, while achieving equivalent accuracy to recurrent models. ... On the SQu AD dataset, our single model, trained with augmented data, achieves 84.6 F1 score1 on the test set, which is significantly better than the best published F1 score of 81.8.
Researcher Affiliation Collaboration Adams Wei Yu1 , David Dohan2 , Minh-Thang Luong2 {weiyu}@cs.cmu.edu, {ddohan,thangluong}@google.com 1Carnegie Mellon University, 2Google Brain Rui Zhao, Kai Chen, Mohammad Norouzi, Quoc V. Le Google Brain
Pseudocode No The paper describes the model and processes in narrative text and diagrams, but does not include explicit pseudocode or algorithm blocks.
Open Source Code No The paper mentions utilizing a 'publicly available codebase' by Luong et al. (2017) for NMT and states 'Tensor Flow implementation: https://www.tensorflow.org/', which refers to the TensorFlow library itself, not the specific code for QANet developed by the authors. No explicit statement or link to the authors' own open-source code for QANet is provided.
Open Datasets Yes We consider the Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al., 2016) for machine reading comprehension. ... We also conduct similar studies on Trivia QA (Joshi et al., 2017), another Q&A dataset, to show that the effectiveness and efficiency of our model are general.
Dataset Splits Yes SQu AD contains 107.7K query-answer pairs, with 87.5K for training, 10.1K for validation, and another 10.1K for testing. ... The Wikipedia sub-dataset contains around 92K training and 11K development examples.
Hardware Specification Yes Finally, we implement our model in Python using Tensorflow (Abadi et al., 2016) and carry out our experiments on an NVIDIA p100 GPU.
Software Dependencies No The paper states 'implemented our model in Python using Tensorflow (Abadi et al., 2016)' and 'We use the NLTK tokenizer to preprocess the data.', but does not provide specific version numbers for Python, TensorFlow, or NLTK.
Experiment Setup Yes The hidden size and the convolution filter number are all 128, the batch size is 32, training steps are 150K for original data, 250K for data augmentation 2 , and 340K for data augmentation 3 . The numbers of convolution layers in the embedding and modeling encoder are 4 and 2, kernel sizes are 7 and 5, and the block numbers for the encoders are 1 and 7, respectively. We use the ADAM optimizer (Kingma & Ba, 2014) with β1 = 0.8, β2 = 0.999, ϵ = 10 7. We use a learning rate warm-up scheme with an inverse exponential increase from 0.0 to 0.001 in the first 1000 steps, and then maintain a constant learning rate for the remainder of training.