Dual Visual Attention Network for Visual Dialog

Authors: Dan Guo, Hui Wang, Meng Wang

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the Vis Dial v0.9 and v1.0 datasets validate the effectiveness of the proposed approach.
Researcher Affiliation Academia Dan Guo , Hui Wang and Meng Wang School of Computer Science and Information Engineering, Hefei University of Technology guodan@hfut.edu.cn, wanghui.hfut@gmail.com, eric.mengwang@gmail.com
Pseudocode No The paper does not contain structured pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any explicit statement or link indicating that the source code for the described methodology is open-source or publicly available.
Open Datasets Yes We evaluate the proposed model on the Vis Dial v0.9 and v1.0 [Das et al., 2017] datasets.
Dataset Splits Yes Vis Dial v0.9 contains 83k dialogs on COCO-train images and 40k dialogs on COCO-val images (totally 1.2M QA pairs). ... Vis Dial v1.0 is an updated version of the Vis Dial v0.9, in which Vis Dial v0.9 is set to be the train split. And the new val and test splits of Vis Dial v1.0 contains 2k and 8k dialogs collected on COCO-like Flickr images, respectively.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions software components like VGG19, Faster R-CNN, NLTK, GloVe embedding, LSTM, Adam optimizer, and Dropout, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes The captions, questions, and answers are truncated to 24/16/8 words for generative models, and 40/20/20 words for discriminative models, respectively. Next, each word is embedded into a 300-dim vector initialized by the Glo Ve embedding [Pennington et al., 2014]. All the LSTMs in our model are 1-layered with 512 hidden states. The Adam optimizer [Kingma and Ba, 2014] is adopted with initialized learning rate 4 10 4, multiplied by 0.5 after each 20 epochs. We also apply Dropout [Srivastava et al., 2014] with radio 0.5 for LSTM, attention modules, and the output of encoder. Finally, generative models are trained with a MLE loss (maximum likelihood estimation), while discriminative models are trained with a multi-class N-pair loss [Lu et al., 2017a].