Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Human-Object Interaction as Groups

Authors: Jiajun Hong, Jianan Wei, Wenguan Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on HICO-DET and V-COCO benchmarks demonstrate the superiority of Group HOI over the state-of-the-art methods. It also exhibits leading performance on the more challenging Nonverbal Interaction Detection (NVI-DET) task, which involves varied forms of higher-order interactions within groups.
Researcher Affiliation Academia Jiajun Hong , Jianan Wei , Wenguan Wang Zhejiang University
Pseudocode Yes D Pseudo Code The pseudo code for semantic and geometric group are given in Algorithm 1 and Algorithm 2.
Open Source Code Yes Zhejiang University https://github.com/Jiajun Hong1/Group HOI
Open Datasets Yes Our method is evaluated on two standard HOI-DET benchmarks: V-COCO [18] and HICO-DET [2]: V-COCO is a specialized subset of MS-COCO [73], which comprises 10,346 images (5,400 for training and 4,946 for testing). HICO-DET comprises a total of 47,776 images, with 38,118 designated for training and 9,658 for testing.
Dataset Splits Yes V-COCO is a specialized subset of MS-COCO [73], which comprises 10,346 images (5,400 for training and 4,946 for testing). HICO-DET comprises a total of 47,776 images, with 38,118 designated for training and 9,658 for testing. NVI [20] densely labeling social groups in pictures, along with 22 atomic-level nonverbal behaviors... It contains 13,711 images in total and splits them into 9,634, 1,418 and 2,659 for train, val and test.
Hardware Specification Yes The model is trained with a batchsize of 8 for 90 epochs on 2 Ge Force RTX 4090 GPUs.
Software Dependencies No Group HOI is implemented in Py Torch. The paper does not specify the version of PyTorch or any other software dependencies with version numbers.
Experiment Setup Yes Our transformer-based architecture consists of a 6-layer encoder, a 3-layer instance decoder, and a 3-layer interaction decoder. Following [3, 4], we initialize 64 learnable queries for human and object branches, and set the feature dimensions to 256 for human/object representations and 768 for interaction representations. We perform group construction independently at each layer of both the instance and interaction decoders, where the geometric and semantic group sizes are set to 4 and 2. LHOI = λb Lb + λu Lu + λo c Lo c + λa c La c, where... The coefficient factors {λb, λu, λo c, λa c} are empirically set as {2.5, 1, 1, 1}. The model is trained with a batchsize of 8 for 90 epochs... The initial learning is set to 5e 5, which reduces by a factor of 10 every 30 epochs.