Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

Authors: Chanhyeong Yang, Taehoon song, Jihwan Park, Hyunwoo J. Kim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement.
Researcher Affiliation Academia Chanhyeong Yang1 Taehoon Song2 Jihwan Park1 Hyunwoo J. Kim2 1Korea University 2Korea Advanced Institute of Science and Technology EMAIL EMAIL
Pseudocode No The paper describes the methodology in prose with mathematical equations and architectural diagrams (Figures 2 and 3), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/mlvlab/VDRP.
Open Datasets Yes We conduct experiments on the HICO-DET benchmark for HOI detection. HICO-DET contains 80 object categories from the COCO dataset [51] and 117 actions, forming 600 HOI classes.
Dataset Splits Yes HICO-DET contains 80 object categories from the COCO dataset [51] and 117 actions, forming 600 HOI classes. It includes 47,776 images, with 38,118 for training and 9,658 for testing. Zero-shot setting on HICO-DET. Following prior works [3, 2, 1, 23], we evaluate under four settings: Non-rare First Unseen Composition (NF-UC), Rare First (RF-UC), Unseen Object (UO), and Unseen Verb (UV). NF-UC and RF-UC define 120 unseen and 480 seen HOI triplets from 600 total, with unseen compositions drawn from head and tail categories, respectively. UO uses 68 object classes to construct 500 seen and 100 unseen triplets. UV withholds 20 out of 117 verb classes, yielding 516 seen and 84 unseen triplets.
Hardware Specification Yes For the base model using CLIP Vi T-B/16, we train on two NVIDIA Ge Force RTX 3090 GPUs with a batch size of 8 for 12 epochs. [...] Due to memory constraints, these experiments are performed on two NVIDIA RTX 6000 Ada Generation GPUs with a reduced batch size of 4 per GPU.
Software Dependencies No The paper mentions using 'Py Torch' for experiments but does not provide specific version numbers for PyTorch or other key software libraries and dependencies.
Experiment Setup Yes For the base model using CLIP Vi T-B/16, we train on two NVIDIA Ge Force RTX 3090 GPUs with a batch size of 8 for 12 epochs. We use the Adam W [53] optimizer with an initial learning rate of 1 10 3, decayed to 1 10 4 using a cosine scheduler and a weight decay of 8. [...] The context embedding consists of Nctx = 24 learnable tokens, each initialized from a Gaussian distribution with standard deviation 0.02. To incorporate group-wise variance, we inject a modulation vector (scaled by α = 0.02) into the context embedding and apply Gaussian perturbation to the resulting prompt embedding with a noise scale β = 0.1. For region-aware prompt augmentation, we generate K = 10 region concepts per verb and region type (human, object, union) [...] and aggregated via Sparsemax [50] to form a concept vector, which is added to the prompt embedding with scaling factor γ = 0.2.