reproducibilityindex.ai

Query-Reduction Networks for Question Answering

Authors: Minjoon Seo, Sewon Min, Ali Farhadi, Hannaneh Hajishirzi

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that QRN produces the state-of-the-art results in b Ab I QA and dialog tasks, and in a real goal-oriented dialog dataset.
Researcher Affiliation	Collaboration	University of Washington1, Seoul National University2, Allen Institute for Artiﬁcial Intelligence3 {minjoon, ali, hannaneh}@cs.washington.edu, shmsw25@snu.ac.kr
Pseudocode	No	The paper describes the model using mathematical equations and figures, but it does not include a dedicated "Pseudocode" or "Algorithm" block.
Open Source Code	Yes	Code is publicly available at: seominjoon.github.io/qrn/
Open Datasets	Yes	b Ab I story-based QA dataset (Weston et al., 2016), b Ab I dialog dataset (Bordes and Weston, 2016), DSTC2 (Task 6) dialog dataset (Henderson et al., 2014)
Dataset Splits	Yes	We withhold 10% of the training for development.
Hardware Specification	Yes	We implement QRN with and without parallelization in Tensor Flow (Abadi et al., 2016) on a single Titan X GPU to qunaitify the computational gain of the parallelization.
Software Dependencies	No	The paper mentions the use of "Tensor Flow (Abadi et al., 2016)" but does not specify a version number for TensorFlow or any other software dependencies.
Experiment Setup	Yes	We use the hidden state size of 50 by deafult. Batch sizes of 32 for b Ab I story-based QA 1k, b Ab I dialog and DSTC2 dialog, and 128 for b Ab I QA 10k are used. The weights in the input and output modules are initialized with zero mean and the standard deviation of 1/ d. Weights in the QRN unit are initialized using techniques by Glorot and Bengio (2010), and are tied across the layers. Forget bias of 2.5 is used for update gates (no bias for reset gates). L2 weight decay of 0.001 (0.0005 for QA 10k) is used for all weights. The loss function is the cross entropy between ˆv and the one-hot vector of the true answer. The loss is minimized by stochastic gradient descent for maximally 500 epochs, but training is early stopped if the loss on the development data does not decrease for 50 epochs. The learning rate is controlled by Ada Grad (Duchi et al., 2011) with the initial learning rate of 0.5 (0.1 for QA 10k).