EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection

Authors: Qinqian Lei, Bo Wang, Robby Tan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Quantitative evaluations on benchmark datasets demonstrate that our EZHOI achieves state-of-the-art performance across various zero-shot settings with only 10.35% to 33.95% of the trainable parameters compared to existing methods.
Researcher Affiliation Collaboration Qinqian Lei1 Bo Wang2 Robby T. Tan1,3 1National University of Singapore 2University of Mississippi 3ASUS Intelligent Cloud Services (AICS) qinqian.lei@u.nus.edu , hawk.rsrch@gmail.com , robby_tan@asus.com
Pseudocode No The paper describes methods using mathematical equations and block diagrams, but it does not include formal pseudocode or algorithm blocks labeled as such.
Open Source Code Yes Code is available at https://github.com/Chelsie Lei/EZ-HOI.
Open Datasets Yes We evaluate our method on HICO-DET by following the established protocol of zero-shot two-stage HOI detection methods [14, 3, 27]. Our object detector utilizes a pre-trained DETR model [5] with a Res Net50 backbone [12]. As for our learnable prompts design, we set p = 2, N = 9. The LLa VA-v1.5-7b model [37] is used to provide text description, as explained in Section 3.1 and 3.2. For all experiments, our batch size is set as 16 on 4 Nvidia A5000 GPUs. We use Adam W [39] as the optimizer and the initial learning rate is 1e-3. For more implementation details, please refer to Appendix Section 7.1.
Dataset Splits No HICO-DET comprises a total of 47,776 images, divided into 38,118 training images and 9,658 test images. This dataset features 600 Human-Object Interaction (HOI) classes, which are combinations derived from 117 action categories and 80 object categories. Our model s performance was evaluated in four distinct zero-shot HOI detection settings, categorized by the criterion for selecting the unseen HOI classes: rare-first unseen composition (RFUC), nonrare-first unseen composition (NFUC), unseen object (UO), and unseen verb (UV).
Hardware Specification Yes For all experiments, our batch size is set as 16 on 4 Nvidia A5000 GPUs.
Software Dependencies No The paper mentions software components like "DETR model", "ResNet50 backbone", "LLaVA-v1.5-7b model", "AdamW" as the optimizer, and "CLIP model", but it does not specify exact version numbers for these software dependencies (e.g., PyTorch 1.9, CUDA 11.1).
Experiment Setup Yes For all experiments, our batch size is set as 16 on 4 Nvidia A5000 GPUs. We use Adam W [39] as the optimizer and the initial learning rate is 1e-3. For more implementation details, please refer to Appendix Section 7.1. ... For the pre-trained CLIP model with the Vi T-B visual encoder, the visual feature dimension dv = 768, while the text feature dimension dt = 512 and the final feature dimension for aligned visual and text features da = 512. For the CLIP model with the Vi T-L visual encoder, dv = 1024, dt = 768, da = 768. We use an off-the-shelf object detector and add a threshold θ to filter out some low-confident predictions and we set θ = 0.2. ... As for our learnable prompts design, we set p = 2, N = 9. ... In Eq. 17, α = 150, and in Eq. 18, τ = 1 during training and τ = 2.8 during inference.