SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Authors: Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang5914-5922
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods, and prove the efficacy of each proposed component. 4 Experiments In this section, we analyze different components of our framework and compare the performance with the SOTA methods. 4.1 Ablation Study We show the effectiveness of the proposed methods on the validation set of VCR. In Tab. 1, we show the experimental results of proposed three components: multihop graph Transformer Hop Trans, scene-graph-aware pretraining Pretrain-V and semantically-relevant scene graphs generated by Text-VSPNet trained by proposed strategy, Scene Graph+. |
| Researcher Affiliation | Academia | 1 Columbia University 2 University of California, Los Angeles |
| Pseudocode | No | The paper includes mathematical equations (1-6) but no explicitly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods... Visual Commonsense Reasoning (Zellers et al. 2019)... VSPNet was originally trained on Visual Genome (VG) (Krishna et al. 2017)... experiments on GQA and SNLIVE dataset in Tab. 3. It is important to note that we focus on validating the generalized advantage of our method across different datasets... The domain of GQA is very close to Visual Genome where (Zellers et al. 2018) is trained. |
| Dataset Splits | Yes | We show the effectiveness of the proposed methods on the validation set of VCR. |
| Hardware Specification | No | The paper mentions replacing an object detector with a stronger one ('Anderson et al. 2018') but does not specify any hardware details like GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers, such as Python versions, specific libraries, or frameworks. |
| Experiment Setup | No | The paper describes the model architecture and training strategies but does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other detailed experimental setup configurations in the main text. |