reproducibilityindex.ai

Toward Open-Set Human Object Interaction Detection

Authors: Mingrui Wu, Yuqi Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our model can detect unknown action classes and combine unknown object classes. Furthermore, it can generalize to over 17k HOI classes while being trained on just 600 HOI classes.
Researcher Affiliation	Academia	1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University 2Institute of Artificial Intelligence, Xiamen University
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks clearly labeled as 'Algorithm' or 'Pseudocode'.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described in this paper. A link to 'https://github.com/openai/CLIP' is provided, but this is for a third-party tool (CLIP) used by the authors, not their own implementation code.
Open Datasets	Yes	We perform our experiments on HICO-DET (Chao et al. 2018) and Visual Genome (VG) (Krishna et al. 2017). ... The HICO-DET dataset consists of 37,536 training images and 9,658 test images. ... VG dataset contains 108, 077 images... We extract a subset from VG to form a VG-HOI dataset with 43118 images...
Dataset Splits	Yes	The HICO-DET dataset consists of 37,536 training images and 9,658 test images. ... On HICO-Det, we split some HOIs as unseen setting following the previous works: including Rare-first unseen combination scenario (RF-UC) (Hou et al. 2020), Non-rare-first UC (NF-UC) (Hou et al. 2020), unseen action scenario (UA) (Liu, Yuan, and Chen 2020) and unseen object scenario (UO) (Bansal et al. 2020).
Hardware Specification	Yes	The interaction head is trained for 20 epochs with about 7 hours on 2 NVIDIA GTX3090 GPUs, with a batch size of 4 per GPU.
Software Dependencies	No	The paper mentions software components like 'Grounding DINO' and 'CLIP' with their backbones (Swin-B, VIT-B/32), but it does not provide specific version numbers for these or any other ancillary software packages required for replication.
Experiment Setup	Yes	In the interaction head, the number of unary adapter layers is 2 and the pair-wise adapter layer is 1. For the open-set object detector, we use the Grounding DINO (Liu et al. 2023) with Swin-B backbone... For VLM, we use the public pre-trained model CLIP 1 with VIT-B/32 backbone... The interaction head is trained for 20 epochs with about 7 hours on 2 NVIDIA GTX3090 GPUs, with a batch size of 4 per GPU.