Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model
Authors: Zhenhao Zhang, Ye Shi, Lingxiao Yang, Suting Ni, Qi Ye, Jingya Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate state-of-the-art performance, outperforming existing methods with a large margin. Notably, Open HOI generalizes robustly to unseen objects and open-vocabulary instructions, achieving strong compositional generalization across diverse scenarios. 4 Experiments Our framework integrates two core stages: 1) fine-tuning a 3D MLLM to predict object affordance maps and decomposing open-vocabulary instructions into concrete sub-tasks, and 2) synthesizing HOI sequences with the condition output of stage 1. We evaluate our method using diverse datasets and metrics, demonstrating its capability to generate long-horizon HOI sequences with high-level instructions of both seen and unseen objects. Comparative experiments and ablations validate our design choices. Our experiments were conducted on NVIDIA A100 GPU. |
| Researcher Affiliation | Academia | Zhenhao Zhang1, Ye Shi1 , Lingxiao Yang1, Suting Ni1, Qi Ye2, Jingya Wang 1 1Shanghai Tech University 2Zhejiang University EMAIL EMAIL |
| Pseudocode | No | The paper describes methods and processes using mathematical formulas and descriptive text in sections like "3.1 Instruction Decomposition and Affordance Reasoning via 3D MLLM" and "B.2 Instruction Decomposition". However, it does not include a clearly labeled pseudocode block or algorithm figure. |
| Open Source Code | No | D Code and Dataset We will release our code as soon as possible. Git Hub is Open HOI |
| Open Datasets | Yes | For our experiments, we utilize two prominent hand-object interaction datasets: GRAB [38], which provides comprehensive full-body motion data of subjects interacting with 51 everyday objects, and ARCTIC [6], a large-scale dataset specializing in bi-manual interactions with articulated objects and dense 3D annotations. [38] Omid Taheri et al. GRAB: A Dataset of Whole-Body Human Grasping of Objects . In: European Conference on Computer Vision (ECCV). 2020. URL: https://grab.is.tue. mpg.de. [6] Zicong Fan et al. ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation . In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2023. |
| Dataset Splits | Yes | For both GRAB [38] and ARCTIC [6], we follow a standard protocol by partitioning each dataset into 80% for training and 20% for unseen testing, ensuring reliable evaluation of our model s generalization capabilities. |
| Hardware Specification | Yes | Our experiments were conducted on NVIDIA A100 GPU. |
| Software Dependencies | No | Appendix A: 3D MLLM. We initialize our model from the Shape LLM-7B checkpoint, freezing its 3D encoder and augmenting the visual backbone with Uni3D for robust dense 3D prediction, while the projection head is implemented as a shallow MLP, and Lo RA is applied to streamline fine-tuning. Training unfolds in two stages: first, we optimize for seven epochs with Adam W (learning rate 2 10 4, zero weight decay) under a cosine-annealing schedule and a 2% linear warm-up; then, we continue for three additional epochs with Adam W (learning rate 5 10 4, zero weight decay) under the same cosine schedule but a 1% warm-up. Diffusion Model. We employ a T = 1000-step noising process with a cosine noise schedule, and inject positional information at both the frameand agent-levels using sinusoidal encodings. During sampling, we apply classifier-free guidance by randomly substituting 10% of conditioning inputs with unconditional noise while retaining 90% of the original conditions, and use a guidance scale of 2.5 to steer the denoising trajectory. |
| Experiment Setup | Yes | Appendix A: 3D MLLM. ... Training unfolds in two stages: first, we optimize for seven epochs with Adam W (learning rate 2 10 4, zero weight decay) under a cosine-annealing schedule and a 2% linear warm-up; then, we continue for three additional epochs with Adam W (learning rate 5 10 4, zero weight decay) under the same cosine schedule but a 1% warm-up. Diffusion Model. We employ a T = 1000-step noising process with a cosine noise schedule, and inject positional information at both the frameand agent-levels using sinusoidal encodings. During sampling, we apply classifier-free guidance by randomly substituting 10% of conditioning inputs with unconditional noise while retaining 90% of the original conditions, and use a guidance scale of 2.5 to steer the denoising trajectory. |