SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Authors: Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang5914-5922

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods, and prove the efficacy of each proposed component. 4 Experiments In this section, we analyze different components of our framework and compare the performance with the SOTA methods. 4.1 Ablation Study We show the effectiveness of the proposed methods on the validation set of VCR. In Tab. 1, we show the experimental results of proposed three components: multihop graph Transformer Hop Trans, scene-graph-aware pretraining Pretrain-V and semantically-relevant scene graphs generated by Text-VSPNet trained by proposed strategy, Scene Graph+.
Researcher Affiliation Academia 1 Columbia University 2 University of California, Los Angeles
Pseudocode No The paper includes mathematical equations (1-6) but no explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods... Visual Commonsense Reasoning (Zellers et al. 2019)... VSPNet was originally trained on Visual Genome (VG) (Krishna et al. 2017)... experiments on GQA and SNLIVE dataset in Tab. 3. It is important to note that we focus on validating the generalized advantage of our method across different datasets... The domain of GQA is very close to Visual Genome where (Zellers et al. 2018) is trained.
Dataset Splits Yes We show the effectiveness of the proposed methods on the validation set of VCR.
Hardware Specification No The paper mentions replacing an object detector with a stronger one ('Anderson et al. 2018') but does not specify any hardware details like GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers, such as Python versions, specific libraries, or frameworks.
Experiment Setup No The paper describes the model architecture and training strategies but does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other detailed experimental setup configurations in the main text.