Modality-Balanced Models for Visual Dialogue
Authors: Hyounghun Kim, Hao Tan, Mohit Bansal8091-8098
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our models achieve strong results on the Visual Dialog challenge 2019 (rank 3 on NDCG and high balance across metrics), and substantially outperform the winner of the Visual Dialog challenge 2018 on most metrics. We first conduct a manual investigation on the Visual Dialog dataset (Vis Dial) to figure out how many questions can be answered only with images and how many of them need conversation history to be answered. |
| Researcher Affiliation | Academia | Hyounghun Kim, Hao Tan, Mohit Bansal Department of Computer Science University of North Carolina at Chapel Hill {hyounghk, airsplay, mbansal}@cs.unc.edu |
| Pseudocode | No | No pseudocode or algorithm blocks are provided. The model architecture and calculations are described using mathematical formulas and text. |
| Open Source Code | No | No explicit statement about releasing source code or a link to a code repository for the methodology described in this paper is provided. |
| Open Datasets | Yes | We use the Vis Dial v1.0 (Das et al. 2017) dataset to train our models |
| Dataset Splits | Yes | The whole dataset is split into 123,287/2,000/8,000 images for train/validation/test, respectively. |
| Hardware Specification | No | No specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments are provided. |
| Software Dependencies | No | The paper mentions using Adam as an optimizer, LSTM-RNN, Faster R-CNN, and MFB, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions). |
| Experiment Setup | Yes | In our models, the size of word vectors is 300, the dimension of visual feature is 2048, and hidden size of LSTM units which are used for encoders of questions, context history, and candidate answers is 512. We set the initial learning rate to 0.001 and decrease it by 0.0001 per epoch until 8th epoch and decay by 0.5 from 9th epoch on. For round dropout, we set the maximum number of history features to be dropped to 3 and we tune the p value to 0.25 for our instance dropout in the consensus dropout fusion module. Cross-entropy is used to calculate the loss. |