Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ViSPLA: Visual Iterative Self-Prompting for Language-Guided 3D Affordance Learning

Authors: Hritam Basak, Zhaozheng Yin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Vi SPLA achieves state-of-the-art results on both seen and unseen objects on two benchmark datasets. Our framework establishes a new paradigm for open-world 3D affordance reasoning by unifying language comprehension with low-level geometric perception through iterative refinement.
Researcher Affiliation Academia Hritam Basak Department of Computer Science Stony Brook University Stony Brook, NY, USA EMAIL Zhaozheng Yin Department of Computer Science Stony Brook University Stony Brook, NY, USA EMAIL
Pseudocode No The paper describes the methodology using mathematical formulations and descriptive text, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Although the datasets used in the paper are open-sourced, we do not release the code for this work. However, sufficient details are provided in the paper to ensure reproducibility.
Open Datasets Yes Following previous works [5, 14], we conduct evaluations on two complementary 3D affordance datasets: PIAD [12] and LASO [14], each designed to test different aspects of generalization.
Dataset Splits Yes LASO, on the other hand, contains 19,751 language-guided point cloud pairs spanning 8,434 unique object instances across 23 object categories and 17 affordance types. It supports both Seen and Unseen splits, where the Unseen configuration deliberately excludes specific affordance-object combinations during training to assess zero-shot generalization.
Hardware Specification Yes All experiments are done on four NVIDIA V100 GPU with a batch size of 16, training for 20 epochs in 12hr.
Software Dependencies No We utilize Phi-3.5-mini-instruct [27] as our base LLM with Lo RA [28] fine-tuning. For 3D processing, we adopt Point-BERT [29] pre-trained with ULIP2 [30] as our point encoder (f P E) and Point Transformer [31] as our point backbone (f P B). We use Adam W optimizer with an initial learning rate of 4 10 5 with cosine scheduling and warm-up ratio of 0.03.
Experiment Setup Yes For our iterative self-prompting mechanism, we set the number of refinement iterations T = 3 (as performance plateaus beyond this point while computational cost rises sharply (see Figure 3)), with weight parameters λt = 0.8t to gradually reduce consistency constraints. In the IDGSP loss, we set α = 0.1 for the Tikhonov regularization term. For INAFS, we use λ1 = 0.5, λ2 = 0.3, and β = 0.05. The SCSP module uses K = 3 frequency bands (following validation in Figure 3) with weights γ1 = 1.0, γ2 = 0.7, γ3 = 0.4, and τ = 0.2 for the total variation term. We use Adam W optimizer with an initial learning rate of 4 10 5 with cosine scheduling and warm-up ratio of 0.03. All experiments are done on four NVIDIA V100 GPU with a batch size of 16, training for 20 epochs in 12hr.