End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

Authors: Mingrui Wu, Jiaxin Gu, Yunhang Shen, Mingbao Lin, Chao Chen, Xiaoshuai Sun

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our Eo ID outperforms the previous SOTAs under various zero-shot settings.
Researcher Affiliation Collaboration 1MAC Lab, School of Informatics, Xiamen University. 2Youtu Lab, Tencent. 3VIS, Baidu Inc. 4Institute of Artificial Intelligence, Xiamen University. 5Fujian Engineering Research Center of Trusted Artificial Intelligence Analysis and Application, Xiamen University.
Pseudocode No The paper includes figures illustrating the architecture and algorithms, but no explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps.
Open Source Code Yes The source code is available at: https://github.com/mrwu-mac/Eo ID.
Open Datasets Yes We perform our experiments on two HOI detection benchmarks: HICO-DET (Chao et al. 2018) and V-COCO (Gupta and Malik 2015).
Dataset Splits Yes For the UC scenario, we use the same 5 sets of 120 unseen HOI classes as Bansal et al. (Bansal et al. 2020).
Hardware Specification Yes Experiments are conducted on 4 Tesla V100 GPUs, with a batch size of 16.
Software Dependencies No The paper mentions using CLIP (Radford et al. 2021) and CDN (Zhang et al. 2021) models, but does not provide specific version numbers for software libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes The query number N is 64. The loss weights λbbox, λgiou, λc, λis, λa and λclip are set to 2.5, 1, 1, 1, 1.6 and 700 respectively. For simplicity, the decoupling dynamic re-weighting in CDN is not used. For CLIP, we use the public pretrained model 1, with an input size of 224 224, and γ=100. The cropped union regions are preprocessed by square padding and resizing. We feed prompt engineered texts to the text encoder of CLIP with a prompt template a picture of person {verb} {object}. Experiments are conducted on 4 Tesla V100 GPUs, with a batch size of 16.