Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robust Cross-modal Alignment Learning for Cross-Scene Spatial Reasoning and Grounding

Authors: Yanglin Feng, Hongyuan Zhu, Dezhong Peng, Xi Peng, Xiaomin Song, Peng Hu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across four multimodal datasets demonstrate that Co Re dramatically reduces computational overhead while showing superiority in both scene retrieval and object grounding.
Researcher Affiliation Collaboration Yanglin Feng1, Hongyuan Zhu2, Dezhong Peng1,3, Xi Peng1, Xiaomin Song4, Peng Hu1 1College of Computer Science, Sichuan University, Chengdu, China. 2Institute for Infocomm Research (I2R), A*STAR, Singapore. 3Tianfu Jincheng Laboratory, Chengdu, China. 4Sichuan National Innovation New Vision UHD Video Technology Co., Ltd., Chengdu, China.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes methodologies using mathematical equations and textual explanations, along with architectural diagrams.
Open Source Code Yes Code is available at https://github.com/Yangl1n Feng/Co Re.
Open Datasets Yes Query descriptions of objects with cross-scene discrimination are generated through our spatial analysis texts of scenes and corresponding corpora from existing text datasets (i.e., Scan Refer [2], Nr3D [50], Sr3D [50], and Scan QA [51]). The 3D point-cloud data and object annotations are sourced from the widely used Scan Net dataset [49].
Dataset Splits No The paper mentions evaluating on different datasets and subsets (e.g., Conspicuous, Regular, Confusing for Cross Scene-RETR), but does not explicitly provide specific training, validation, or test dataset splits (percentages or sample counts) in the main text. It states that implementation and training details can be found in the supplementary material.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used for running the experiments in the main text. While the NeurIPS checklist claims compute resources are reported in experiment settings, these details are not found in the provided paper content.
Software Dependencies No The paper mentions the use of pre-trained models like BERT [40], Mask3D [41], and Point Net++ [42], but does not provide specific version numbers for these or other ancillary software components (e.g., programming languages, libraries, or frameworks) in the main text.
Experiment Setup No The paper describes the loss function with trade-off parameters (λm, λg) and a temperature parameter (τ), and discusses the variable 'q' in the loss function, but it does not provide specific numerical values for these hyperparameters (e.g., learning rate, batch size, number of epochs) or other system-level training settings in the main text. It refers to supplementary material for implementation details.