Generative Visual Dialogue System via Weighted Likelihood Estimation

Authors: Heming Zhang, Shalini Ghosh, Larry Heck, Stephen Walsh, Junting Zhang, Jie Zhang, C.-C. Jay Kuo

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results on the Vis Dial benchmark demonstrate the superiority of our proposed algorithm over other state-of-the-art approaches, with an improvement of 5.81% on recall@10.
Researcher Affiliation Collaboration 1University of Southern California 2Samsung Research America 3Arizona State University
Pseudocode No The paper describes methods and equations, but it does not include a distinct pseudocode block or algorithm section.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets Yes We evaluate our proposed model on the Vis Dial dataset [Das et al., 2017]. In Vis Dial v0.9, on which most previous work has benchmarked, there are in total 83k and 40k dialogues on COCO-train and COCO-val images, respectively.
Dataset Splits Yes We follow the methodology in [Lu et al., 2017] and split the data into 82k for train, 1k for val and 40k for test. In the new version Vis Dial v1.0, which was used for the Visual Dialog Challenge 2018, train consists of the previous 123k images and corresponding dialogues. 2k and 8k images with dialogues are collected for val and test, respectively.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU models or CPU specifications. It mentions 'pre-trained CNN models (VGG, Res Net)' which are software models, not hardware.
Software Dependencies No The paper mentions software components like 'LSTM decoder' and 'Adam optimizer' but does not provide specific version numbers for these or other libraries/frameworks (e.g., PyTorch, TensorFlow version).
Experiment Setup Yes We use 512D word embeddings, which are trained from scratch and shared by question, dialogue history and decoder LSTMs. We also set all LSTMs to have single layer with 512D hidden state for consistency with other works. We use the Adam optimizer with the base learning rate of 4 10 4.