CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection
Authors: Yunyao Mao, Jiajun Deng, Wengang Zhou, Li Li, Yao Fang, Houqiang Li
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, experiments on prevalent benchmarks show that our CLIP4HOI outperforms previous approaches on both rare and unseen categories, and sets a series of state-of-the-art records under a variety of zero-shot settings. To verify the effectiveness, we conducted extensive experiments on two prevalent HOI detection benchmarks, i.e., HICO-DET [4] and V-COCO [14]. Results show that our approach exhibits superior performance under a variety of zero-shot settings. |
| Researcher Affiliation | Collaboration | Yunyao Mao1 Jiajun Deng2 Wengang Zhou1 Li Li1 Yao Fang3 Houqiang Li1 1CAS Key Laboratory of Technology in GIPAS, EEIS Department, University of Science and Technology of China 2The University of Adelaide, AIML 3Merchants Union Consumer Finance Company Limited |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found. |
| Open Source Code | No | No explicit statement or link to open-source code was found. |
| Open Datasets | Yes | To verify the effectiveness, we conducted extensive experiments on two prevalent HOI detection benchmarks, i.e., HICO-DET [4] and V-COCO [14]. |
| Dataset Splits | No | The paper defines seen and unseen categories for zero-shot settings but does not explicitly provide numeric percentages or sample counts for training, validation, and test data splits. |
| Hardware Specification | No | This work was supported by NSFC under Contract U20A20183 and 62021001. It was also supported by GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC, and the Supercomputing Center of the USTC. The text mentions a "GPU cluster" but no specific hardware models or detailed specifications. |
| Software Dependencies | No | The paper mentions using pre-trained DETR, ResNet-50, and CLIP models, but does not provide specific version numbers for software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) or CUDA. |
| Experiment Setup | Yes | The HOI decoder has Nl = 6 layers. In each layer, the embedding dimension is 768, the head number of the multi-head attention is 12, and the hidden dimension of the feed-forward network is 3072. The prompt lengths for [PREFIX] and [CONJUN] are 8 and 2, respectively. Following [51], the hyper-parameter λ is set to 1 during training and 2.8 during inference. (Section 5.1) and The final training loss is formulated as the combination of global loss and pairwise loss: Lfinal = Focal BCE(Sglob, Yglob) + β Focal BCE(Spair, Ypair), where Yglob {0, 1}1 Nc and Ypair {0, 1}Npair Nc are global and pairwise labels, respectively. β is a hyper-parameter that adjusts the weight of pairwise loss. (Section 4.5) |