Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

Authors: Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, Na Zhao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments This section presents a comprehensive experimental evaluation of our proposed fine-grained embodied reasoning framework, validating its effectiveness in joint affordance grounding and motion estimation. In addition, we conduct an in-depth analysis to assess how key module optimizations impact the system s overall accuracy and robustness.
Researcher Affiliation Academia Xinyi Wang1, , Xun Yang1, , Yanlong Xu1, Yuchen Wu2, Zhen Li3, Na Zhao2, 1 University of Science and Technology of China 2 Singapore University of Technology and Design 3 Chinese University of Hong Kong, Shenzhen
Pseudocode No The paper describes the Chain-of-thought Reasoning pipeline in Section 3.3 with steps: Active View Selection, Affordance Grounding, and Motion Estimation, all presented in narrative form without a formal pseudocode or algorithm block.
Open Source Code Yes Our code is available at https://github.com/hannahwxy/Afford Bot.
Open Datasets Yes Dataset. We conduct experiments on Scene Fun3D [34], currently the only dataset that provides comprehensive annotations for fine-grained affordance grounding and motion estimation in 3D indoor scenes.
Dataset Splits Yes It comprises a total of 230 richly annotated scenes, including 200 scenes for training, 30 for validation.
Hardware Specification Yes For visual-language reasoning, we employ Qwen2.5-VL-72B [15] locally deployed on four NVIDIA A800 GPUs. ... We train for 1,000 epochs on an NVIDIA A800 with the learning rate of 0.0001, a batch size of 2, and 2cm voxelization to preserve spatial detail.
Software Dependencies No The paper mentions using Qwen2.5-VL-72B and fine-tuning Mask3D, but does not provide specific version numbers for general software dependencies like programming languages or deep learning frameworks.
Experiment Setup Yes We train for 1,000 epochs on an NVIDIA A800 with the learning rate of 0.0001, a batch size of 2, and 2cm voxelization to preserve spatial detail.