Exploiting the Social-Like Prior in Transformer for Visual Reasoning

Authors: Yudong Han, Yupeng Hu, Xuemeng Song, Haoyu Tang, Mingzhu Xu, Liqiang Nie

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model outperforms a bunch of baselines by a noticeable margin when considering our social-like prior on five benchmarks in VQA and REC tasks, and a series of explanatory results are showcased to sufficiently reveal the social-like behaviors in SA.
Researcher Affiliation Academia 1School of Software, Shandong University 2School of Computer Science and Technology, Shandong University 3School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen) {hanyudong.sdu, sxmustc, nieliqiang}@gmail.com, {huyupeng, tanghao258, xumingzhu}@sdu.edu.cn
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code (e.g., repository link, explicit statement of code release) for the methodology described.
Open Datasets Yes Datasets. VQA 2.0 is the most commonly used benchmark dataset for VQA, which is developed based on VQA 1.0. The images stems from Microsoft COCO (Lin et al. 2014). The overall dataset has about 1000K examples, which are splited into train, val and test, respectively. CLEVR is a synthetic diagnostic dataset... Ref COCO, Ref COCO+, and Ref COCOg are three commonly used benchmarks for REC.
Dataset Splits Yes VQA 2.0 is the most commonly used benchmark dataset for VQA... The overall dataset has about 1000K examples, which are splited into train, val and test, respectively. CLEVR... 70K/ 15K images and 700K/ 150K questions in the train/val set... Ref COCO... which is split into train, val, test A, and test B set.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types) used for running its experiments.
Software Dependencies No The paper mentions software components like 'GLOVE embeddings', 'LSTM', 'Res Next152', and 'BERT model' but does not specify their version numbers for reproducibility.
Experiment Setup Yes In VQA task, the model configuration for VQA 2.0 and CLEVR are similar... The numbers of training epochs for VQA 2.0 and CLEVR are set to 13 and 16, respectively, and warming-up strategy is adopted in the first three epochs. The learning rates is initialized by 1e-4, which are decayed by 0.2 on the 10-th, 13-th and 15-th epochs. The batch size is set to 64. In REC task... our model is trained for 90 epochs with a initial 1e-4 learning rate dropped by a factor of 10 after 60 epochs, except Ref COCOg with training for 60 epochs and dropping after 40 epochs, and we set the batch size to 16.