Re-Attention for Visual Question Answering
Authors: Wenya Guo, Ying Zhang, Xiaoping Wu, Jufeng Yang, Xiangrui Cai, Xiaojie Yuan91-98
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the benchmark dataset demonstrate the proposed method performs favorably against the state-of-the-art approaches. |
| Researcher Affiliation | Academia | 1College of Computer Science, Nankai University, Tianjin, China 2College of Cyber Science, Nankai Univeristy, Tianjin, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to its source code, nor does it explicitly state that the code will be made available. |
| Open Datasets | Yes | In our experiments, we use the VQA v2 dataset (Goyal et al. 2017) to evaluate the performance of the proposed method. |
| Dataset Splits | Yes | The dataset is typically split into train set, validation set, and test set, which contain 82k, 40k, and 81k images with 443k, 214k, and 447k questions, respectively. |
| Hardware Specification | Yes | All of our approaches are trained on an NVIDIA GTX 1080ti with 11GB on-board memory. |
| Software Dependencies | No | The paper mentions 'Py Torch' and 'Adam' but does not specify version numbers for PyTorch or other key software dependencies. |
| Experiment Setup | Yes | The number of objects in the image and the number of words in the question are padded as 100 and 14, respectively. For the hyper-parameters like the hidden size of the LSTM, dq, the experimental results are stable when hq changes. Therefore, it is set as 512 following (Tang et al. 2019) after we comprehensively consider the trade-off between model complexity and performance. The input size of the LSTM is set as 300 in the same way. The dimension of the object features dv is 2048. The dimension of ˆq and ˆv, i.e., dc is 512. The size of the answer set is 3, 129 following the strategy in (Teney et al. 2018). All the models are trained with batch size 64. Our framework is implemented using Py Torch and trained with Adam (Kingma and Ba 2015). All of our approaches are trained on an NVIDIA GTX 1080ti with 11GB on-board memory. Only the train split is used during model training for the results evaluated on the validation split. For the performance on the test split, part of samples in the Visual Genome dataset (Krishna et al. 2017) is used as the augmented dataset to facilitate model training following (Yu et al. 2019). ... As shown in Table 2, the performance varies along with the λr. And both of the base+re-att method and base+co+re-att (Ours) method achieve best performance when λr is 0.8. Therefore, we set λr as 0.8 in the section of ablation study and the comparison with state-of-the-art methods. |