RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

Authors: Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on two standard visual relational reasoning benchmarks: HICO and GQA. Beyond the original independent and identically distributed (I.I.D.) training-testing split, we introduce new systematic splits for each dataset to examine the ability of systematic generalization, i.e., recognizing novel object-relation combinations. Our results show that Rel Vi T significantly outperforms previous approaches. On HICO, it improves the best baseline by 16%, 43%, and 7% on the original non-systematic and two new systematic splits, respectively, as shown in Figure 2. On GQA, it further closes the gap of overall accuracy between models using visual backbone feature only and models using additional bounding box features (obtained from pre-trained object detectors) by 13% and 18% on the two splits. We also show that our method is compatible with various Vi T variants and robust to hyperparameters. Finally, our qualitative inspection indicates that Rel Vi T does improve Vi Ts on learning relational and object-centric representations.
Researcher Affiliation Collaboration Xiaojian Ma1 , Weili Nie2 , Zhiding Yu2 , Huaizu Jiang3 , Chaowei Xiao2,4 , Yuke Zhu2,5, Song-Chun Zhu1 , Anima Anandkumar2,6 1UCLA 2NVIDIA 3Northeastern University 4ASU 5UT Austin 6Caltech
Pseudocode Yes Algorithm 1 Rel Vi T: Concept-guided Vision Transformer
Open Source Code No No explicit statement or link for open-source code was found.
Open Datasets Yes We evaluate our method on two standard visual relational reasoning benchmarks: HICO (Chao et al., 2015) and GQA (Hudson & Manning, 2019).
Dataset Splits Yes We conduct experiments on two challenging visual relational reasoning datasets: HICO (Chao et al., 2015) and GQA (Hudson & Manning, 2019). Besides their original non-systematic split, we introduce the systematic splits of each dataset to evaluate the systematic generalization of our method. The results are reported on the full validation set of GQA. Table 5: Statistics of the splits of HICO dataset. Splits #Training samples #Training HOIs #Testing samples #Testing HOIs Original 38118 600 9658 600 Systematic-easy 37820 480 9658 600 Systematic-hard 9903 480 9658 600. Table 6: Statistics of the splits of GQA dataset. Splits #Training samples #Testing samples Original 943000 132062 Systematic 711945 32509
Hardware Specification No No specific hardware details (e.g., GPU models, CPU types, or cloud instance specifications) were mentioned for running experiments.
Software Dependencies No The paper mentions 'We use the python nltk package to process the question.' but does not provide a specific version number for it or other software dependencies.
Experiment Setup Yes Table 3: Hyperparameters for Rel Vi T. Optimizer Adam W with epsilon 1e-1 (HICO) / 1e-5 (GQA) Gradient clipping norm No grad clipping (HICO) / 0.5 (GQA) Base learning rate 1.5e-4 (HICO) / 3e-5 (GQA) Learning rate schedule 0.1 scale with milestones [15, 25] (HICO) / [8, 10] (GQA) Batch size 16 (HICO) / 64 (GQA) Total training epochs 30 (HICO) / 12 (GQA) Temperature τ in DINO loss 0.04 for teacher and 0.1 for student, we don’t use schedule. Momentum m for teacher 0.999 Center m for center features 0.9 Sampling method most-recent (HICO) / uniform (GQA) Queue size |Q| 10