Visual Reference Resolution using Attention Memory for Visual Dialog
Authors: Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on a new synthetic visual dialog dataset, we show that our model significantly outperforms the state-of-the-art (by 16 % points) in situations, where visual reference resolution plays an important role. Moreover, the proposed model achieves superior performance ( 2 % points improvement) in the Visual Dialog dataset [1], despite having significantly fewer parameters than the baselines. |
| Researcher Affiliation | Collaboration | Paul Hongsuck Seo Andreas Lehrmann Bohyung Han Leonid Sigal POSTECH Disney Research {hsseo, bhhan}@postech.ac.kr {andreas.lehrmann, lsigal}@disneyresearch.com |
| Pseudocode | No | The paper describes the architecture and processes in detail, including diagrams, but it does not contain a formal pseudocode block or algorithm listing. |
| Open Source Code | No | The paper states that the MNIST Dialog dataset is available at a given URL, but it does not provide any concrete access information (e.g., a specific repository link or an explicit statement of code release) for the model's source code. |
| Open Datasets | Yes | We create a synthetic dataset, called MNIST Dialog3, which is designed for the analysis of models in the task of visual reference resolution with ambiguous expressions. Each image in MNIST Dialog contains a 4 4 grid of MNIST digits and each MNIST digit in the grid has four randomly sampled attributes, i.e., color = {red, blue, green, purple, brown}, bgcolor = {cyan, yellow, white, silver, salmon}, number = {x|0 x 9} and style = {flat, stroke}, as illustrated in Figure 1. Given the generated image from MNIST Dialog, we automatically generate questions and answers about a subset of the digits in the grid that focus on visual reference resolution. There are two types of questions: (i) counting questions and (ii) attribute questions that refer to a single target digit. During question generation, the target digits for a question is selected based on a subset of the previous targets referred to by ambiguous expressions, as shown in Figure 1. For ease of evaluation, we generate a single word answer rather than a sentence for each question and there are a total of 38 possible answers ( 1 38 chance performance). We generated 30K / 10K / 10K images for training / validation / testing, respectively, and three ten-question dialogs for each image. The dataset is available at http://cvlab.postech.ac.kr/research/attmem |
| Dataset Splits | Yes | We generated 30K / 10K / 10K images for training / validation / testing, respectively, and three ten-question dialogs for each image. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory specifications) used for running the experiments. It only mentions using VGG-16 for feature extraction and training with Adam optimizer. |
| Software Dependencies | No | The paper describes general software components and models (e.g., LSTMs, CNNs, Adam optimizer) but does not provide specific version numbers for programming languages, libraries, or frameworks used (e.g., Python 3.x, TensorFlow x.x, PyTorch x.x). |
| Experiment Setup | Yes | The dimensionality of the word embedding and the hidden state in the LSTMs are set to 32 and 64, respectively. All LSTMs are single-layered... The image feature extraction module is formed by stacking four 3 3 convolutional layers with a subsequent 2 2 pooling layer. The first two convolutional layers have 32 channels, while there are 64 channels in the last two. Finally, we use 512 weight candidates to hash the dynamic parameters of the attention combination process. The entire network is trained end-to-end by minimizing the cross entropy of the predicted answer distribution at every step of the dialogs. [...] We train the network using Adam [40] with the initial learning rate of 0.001 and weight decaying factor 0.0001. Note that we do not update the feature extraction network based on VGG-16. |