EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection
Authors: Qinqian Lei, Bo Wang, Robby Tan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Quantitative evaluations on benchmark datasets demonstrate that our EZHOI achieves state-of-the-art performance across various zero-shot settings with only 10.35% to 33.95% of the trainable parameters compared to existing methods. |
| Researcher Affiliation | Collaboration | Qinqian Lei1 Bo Wang2 Robby T. Tan1,3 1National University of Singapore 2University of Mississippi 3ASUS Intelligent Cloud Services (AICS) qinqian.lei@u.nus.edu , hawk.rsrch@gmail.com , robby_tan@asus.com |
| Pseudocode | No | The paper describes methods using mathematical equations and block diagrams, but it does not include formal pseudocode or algorithm blocks labeled as such. |
| Open Source Code | Yes | Code is available at https://github.com/Chelsie Lei/EZ-HOI. |
| Open Datasets | Yes | We evaluate our method on HICO-DET by following the established protocol of zero-shot two-stage HOI detection methods [14, 3, 27]. Our object detector utilizes a pre-trained DETR model [5] with a Res Net50 backbone [12]. As for our learnable prompts design, we set p = 2, N = 9. The LLa VA-v1.5-7b model [37] is used to provide text description, as explained in Section 3.1 and 3.2. For all experiments, our batch size is set as 16 on 4 Nvidia A5000 GPUs. We use Adam W [39] as the optimizer and the initial learning rate is 1e-3. For more implementation details, please refer to Appendix Section 7.1. |
| Dataset Splits | No | HICO-DET comprises a total of 47,776 images, divided into 38,118 training images and 9,658 test images. This dataset features 600 Human-Object Interaction (HOI) classes, which are combinations derived from 117 action categories and 80 object categories. Our model s performance was evaluated in four distinct zero-shot HOI detection settings, categorized by the criterion for selecting the unseen HOI classes: rare-first unseen composition (RFUC), nonrare-first unseen composition (NFUC), unseen object (UO), and unseen verb (UV). |
| Hardware Specification | Yes | For all experiments, our batch size is set as 16 on 4 Nvidia A5000 GPUs. |
| Software Dependencies | No | The paper mentions software components like "DETR model", "ResNet50 backbone", "LLaVA-v1.5-7b model", "AdamW" as the optimizer, and "CLIP model", but it does not specify exact version numbers for these software dependencies (e.g., PyTorch 1.9, CUDA 11.1). |
| Experiment Setup | Yes | For all experiments, our batch size is set as 16 on 4 Nvidia A5000 GPUs. We use Adam W [39] as the optimizer and the initial learning rate is 1e-3. For more implementation details, please refer to Appendix Section 7.1. ... For the pre-trained CLIP model with the Vi T-B visual encoder, the visual feature dimension dv = 768, while the text feature dimension dt = 512 and the final feature dimension for aligned visual and text features da = 512. For the CLIP model with the Vi T-L visual encoder, dv = 1024, dt = 768, da = 768. We use an off-the-shelf object detector and add a threshold θ to filter out some low-confident predictions and we set θ = 0.2. ... As for our learnable prompts design, we set p = 2, N = 9. ... In Eq. 17, α = 150, and in Eq. 18, τ = 1 during training and τ = 2.8 during inference. |