reproducibilityindex.ai

End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

Authors: Mingrui Wu, Jiaxin Gu, Yunhang Shen, Mingbao Lin, Chao Chen, Xiaoshuai Sun

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our Eo ID outperforms the previous SOTAs under various zero-shot settings.
Researcher Affiliation	Collaboration	1MAC Lab, School of Informatics, Xiamen University. 2Youtu Lab, Tencent. 3VIS, Baidu Inc. 4Institute of Artificial Intelligence, Xiamen University. 5Fujian Engineering Research Center of Trusted Artificial Intelligence Analysis and Application, Xiamen University.
Pseudocode	No	The paper includes figures illustrating the architecture and algorithms, but no explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps.
Open Source Code	Yes	The source code is available at: https://github.com/mrwu-mac/Eo ID.
Open Datasets	Yes	We perform our experiments on two HOI detection benchmarks: HICO-DET (Chao et al. 2018) and V-COCO (Gupta and Malik 2015).
Dataset Splits	Yes	For the UC scenario, we use the same 5 sets of 120 unseen HOI classes as Bansal et al. (Bansal et al. 2020).
Hardware Specification	Yes	Experiments are conducted on 4 Tesla V100 GPUs, with a batch size of 16.
Software Dependencies	No	The paper mentions using CLIP (Radford et al. 2021) and CDN (Zhang et al. 2021) models, but does not provide specific version numbers for software libraries like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	The query number N is 64. The loss weights λbbox, λgiou, λc, λis, λa and λclip are set to 2.5, 1, 1, 1, 1.6 and 700 respectively. For simplicity, the decoupling dynamic re-weighting in CDN is not used. For CLIP, we use the public pretrained model 1, with an input size of 224 224, and γ=100. The cropped union regions are preprocessed by square padding and resizing. We feed prompt engineered texts to the text encoder of CLIP with a prompt template a picture of person {verb} {object}. Experiments are conducted on 4 Tesla V100 GPUs, with a batch size of 16.