Toward Open-Set Human Object Interaction Detection
Authors: Mingrui Wu, Yuqi Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our model can detect unknown action classes and combine unknown object classes. Furthermore, it can generalize to over 17k HOI classes while being trained on just 600 HOI classes. |
| Researcher Affiliation | Academia | 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University 2Institute of Artificial Intelligence, Xiamen University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks clearly labeled as 'Algorithm' or 'Pseudocode'. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. A link to 'https://github.com/openai/CLIP' is provided, but this is for a third-party tool (CLIP) used by the authors, not their own implementation code. |
| Open Datasets | Yes | We perform our experiments on HICO-DET (Chao et al. 2018) and Visual Genome (VG) (Krishna et al. 2017). ... The HICO-DET dataset consists of 37,536 training images and 9,658 test images. ... VG dataset contains 108, 077 images... We extract a subset from VG to form a VG-HOI dataset with 43118 images... |
| Dataset Splits | Yes | The HICO-DET dataset consists of 37,536 training images and 9,658 test images. ... On HICO-Det, we split some HOIs as unseen setting following the previous works: including Rare-first unseen combination scenario (RF-UC) (Hou et al. 2020), Non-rare-first UC (NF-UC) (Hou et al. 2020), unseen action scenario (UA) (Liu, Yuan, and Chen 2020) and unseen object scenario (UO) (Bansal et al. 2020). |
| Hardware Specification | Yes | The interaction head is trained for 20 epochs with about 7 hours on 2 NVIDIA GTX3090 GPUs, with a batch size of 4 per GPU. |
| Software Dependencies | No | The paper mentions software components like 'Grounding DINO' and 'CLIP' with their backbones (Swin-B, VIT-B/32), but it does not provide specific version numbers for these or any other ancillary software packages required for replication. |
| Experiment Setup | Yes | In the interaction head, the number of unary adapter layers is 2 and the pair-wise adapter layer is 1. For the open-set object detector, we use the Grounding DINO (Liu et al. 2023) with Swin-B backbone... For VLM, we use the public pre-trained model CLIP 1 with VIT-B/32 backbone... The interaction head is trained for 20 epochs with about 7 hours on 2 NVIDIA GTX3090 GPUs, with a batch size of 4 per GPU. |