Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Authors: Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% m Io U on Scan Net). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIPs language space, enabling open-world perception.
Researcher Affiliation Academia Yujia Zhang1 Xiaoyang Wu1 Yixing Lao1 Chengyao Wang2 Zhuotao Tian3 Naiyan Wang1 Hengshuang Zhao1 1The University of Hong Kong 2The Chinese University of Hong Kong 3Harbin Institute of Technology (Shenzhen)
Pseudocode No The paper includes architectural diagrams (Figure 3) and descriptive text for its methodology, but it does not present a formal, structured pseudocode block or algorithm.
Open Source Code Yes Concerto is based on the current welcome codebase Pointcept, which contains code and required datasets in it. And we provide with the code and data.
Open Datasets Yes We train Concerto on Scan Net [15], Scan Net++ [52], Structured3D [60], S3DIS [1], Arkit Scenes [7], and HM3D [33] datasets, utilizing Scan Net, Scan Net200, Scan Net++, and S3DIS to evaluate the model by linear probing, decoder probing, and full fine-tuning and Scan Net Data Efficient [19] to evaluate the data efficiency. The pre-training setting is the default, described in Tab. 6. More specific pre-training details are available in the Appendix. We use the open-source datasets Scan Net [15], Scan Net++ [52], S3DIS [1], Structured3D [60], ARKit Scenes [7], Habitat Matterport3D [33] and Real Estate10K [61] in latest versions.
Dataset Splits Yes For evaluation with linear probing, decoder probing, and full fine-tuning, we train on the train split and test on the val split of Scan Net, Scan Net++, Scan Net200, and Area 5 of S3DIS. We adopt the Scan Net Data Efficient [19] benchmark and compare the validation m Io U(%) results of Concerto with previous methods in three evaluation protocols.
Hardware Specification Yes GPU: Nvidia H20 16 for pretraining; Nvidia H20 8 for evaluation. CPU: 360 for pretraining; 180 for evaluation. Memory: 3600GB for pretraining; 1800GB for evaluation.
Software Dependencies Yes CUDA version: 12.4 Py Torch version: 2.4.1 Python version: 3.10.15
Experiment Setup Yes We use Adam W as the optimizer, and cosine annealing policy as the scheduler. The learning rate is adjusted with the encoder depth, and the max one is 0.004. The pretraining epoch is 100. For cross-modal joint embedding prediction, we set DINOv2 image encoder input resolution 518 518.