Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models
Authors: Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, Na Zhao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments This section presents a comprehensive experimental evaluation of our proposed fine-grained embodied reasoning framework, validating its effectiveness in joint affordance grounding and motion estimation. In addition, we conduct an in-depth analysis to assess how key module optimizations impact the system s overall accuracy and robustness. |
| Researcher Affiliation | Academia | Xinyi Wang1, , Xun Yang1, , Yanlong Xu1, Yuchen Wu2, Zhen Li3, Na Zhao2, 1 University of Science and Technology of China 2 Singapore University of Technology and Design 3 Chinese University of Hong Kong, Shenzhen |
| Pseudocode | No | The paper describes the Chain-of-thought Reasoning pipeline in Section 3.3 with steps: Active View Selection, Affordance Grounding, and Motion Estimation, all presented in narrative form without a formal pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is available at https://github.com/hannahwxy/Afford Bot. |
| Open Datasets | Yes | Dataset. We conduct experiments on Scene Fun3D [34], currently the only dataset that provides comprehensive annotations for fine-grained affordance grounding and motion estimation in 3D indoor scenes. |
| Dataset Splits | Yes | It comprises a total of 230 richly annotated scenes, including 200 scenes for training, 30 for validation. |
| Hardware Specification | Yes | For visual-language reasoning, we employ Qwen2.5-VL-72B [15] locally deployed on four NVIDIA A800 GPUs. ... We train for 1,000 epochs on an NVIDIA A800 with the learning rate of 0.0001, a batch size of 2, and 2cm voxelization to preserve spatial detail. |
| Software Dependencies | No | The paper mentions using Qwen2.5-VL-72B and fine-tuning Mask3D, but does not provide specific version numbers for general software dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | Yes | We train for 1,000 epochs on an NVIDIA A800 with the learning rate of 0.0001, a batch size of 2, and 2cm voxelization to preserve spatial detail. |