Query-Reduction Networks for Question Answering
Authors: Minjoon Seo, Sewon Min, Ali Farhadi, Hannaneh Hajishirzi
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that QRN produces the state-of-the-art results in b Ab I QA and dialog tasks, and in a real goal-oriented dialog dataset. |
| Researcher Affiliation | Collaboration | University of Washington1, Seoul National University2, Allen Institute for Artificial Intelligence3 {minjoon, ali, hannaneh}@cs.washington.edu, shmsw25@snu.ac.kr |
| Pseudocode | No | The paper describes the model using mathematical equations and figures, but it does not include a dedicated "Pseudocode" or "Algorithm" block. |
| Open Source Code | Yes | Code is publicly available at: seominjoon.github.io/qrn/ |
| Open Datasets | Yes | b Ab I story-based QA dataset (Weston et al., 2016), b Ab I dialog dataset (Bordes and Weston, 2016), DSTC2 (Task 6) dialog dataset (Henderson et al., 2014) |
| Dataset Splits | Yes | We withhold 10% of the training for development. |
| Hardware Specification | Yes | We implement QRN with and without parallelization in Tensor Flow (Abadi et al., 2016) on a single Titan X GPU to qunaitify the computational gain of the parallelization. |
| Software Dependencies | No | The paper mentions the use of "Tensor Flow (Abadi et al., 2016)" but does not specify a version number for TensorFlow or any other software dependencies. |
| Experiment Setup | Yes | We use the hidden state size of 50 by deafult. Batch sizes of 32 for b Ab I story-based QA 1k, b Ab I dialog and DSTC2 dialog, and 128 for b Ab I QA 10k are used. The weights in the input and output modules are initialized with zero mean and the standard deviation of 1/ d. Weights in the QRN unit are initialized using techniques by Glorot and Bengio (2010), and are tied across the layers. Forget bias of 2.5 is used for update gates (no bias for reset gates). L2 weight decay of 0.001 (0.0005 for QA 10k) is used for all weights. The loss function is the cross entropy between ˆv and the one-hot vector of the true answer. The loss is minimized by stochastic gradient descent for maximally 500 epochs, but training is early stopped if the loss on the development data does not decrease for 50 epochs. The learning rate is controlled by Ada Grad (Duchi et al., 2011) with the initial learning rate of 0.5 (0.1 for QA 10k). |