End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation
Authors: Mingrui Wu, Jiaxin Gu, Yunhang Shen, Mingbao Lin, Chao Chen, Xiaoshuai Sun
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our Eo ID outperforms the previous SOTAs under various zero-shot settings. |
| Researcher Affiliation | Collaboration | 1MAC Lab, School of Informatics, Xiamen University. 2Youtu Lab, Tencent. 3VIS, Baidu Inc. 4Institute of Artificial Intelligence, Xiamen University. 5Fujian Engineering Research Center of Trusted Artificial Intelligence Analysis and Application, Xiamen University. |
| Pseudocode | No | The paper includes figures illustrating the architecture and algorithms, but no explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps. |
| Open Source Code | Yes | The source code is available at: https://github.com/mrwu-mac/Eo ID. |
| Open Datasets | Yes | We perform our experiments on two HOI detection benchmarks: HICO-DET (Chao et al. 2018) and V-COCO (Gupta and Malik 2015). |
| Dataset Splits | Yes | For the UC scenario, we use the same 5 sets of 120 unseen HOI classes as Bansal et al. (Bansal et al. 2020). |
| Hardware Specification | Yes | Experiments are conducted on 4 Tesla V100 GPUs, with a batch size of 16. |
| Software Dependencies | No | The paper mentions using CLIP (Radford et al. 2021) and CDN (Zhang et al. 2021) models, but does not provide specific version numbers for software libraries like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | The query number N is 64. The loss weights λbbox, λgiou, λc, λis, λa and λclip are set to 2.5, 1, 1, 1, 1.6 and 700 respectively. For simplicity, the decoupling dynamic re-weighting in CDN is not used. For CLIP, we use the public pretrained model 1, with an input size of 224 224, and γ=100. The cropped union regions are preprocessed by square padding and resizing. We feed prompt engineered texts to the text encoder of CLIP with a prompt template a picture of person {verb} {object}. Experiments are conducted on 4 Tesla V100 GPUs, with a batch size of 16. |