Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment
Authors: QiXu, Dongxu Wei, Lingzhe Zhao, Wenpu Li, Zhangchi Huang, Shunping Ji, Peidong Liu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method achieves state-of-the-art performance not only on the individual tasks of 3D reconstruction and understanding, but also on the task of simultaneous understanding and 3D reconstruction, highlighting the advantages of our alignment-free framework and the effectiveness of the mutual benefit designs. ... Quantitative Results. As shown in Table 1, our approach outperforms all baselines across all tasks by a clear margin. ... Ablation Studies |
| Researcher Affiliation | Academia | 1Wuhan University 2Westlake University 3Westlake Institute for Advanced Study 4Zhejiang University |
| Pseudocode | Yes | Algorithm 1 Pixel-aligned 2D-to-3D lifting for simultaneous understanding and 3D recontruction. |
| Open Source Code | Yes | 5. Open access to data and code... Answer: [Yes] Justification: We have included our code and its running instructions in our supplementary material. |
| Open Datasets | Yes | We utilize Scan Net[17] for training and validation, the largest public dataset that concurrently provides multi-view images with dense semantic/instance segmentation labels and text-referred segmentation labels[56]. ... [17] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scan Net: Richly-Annotated 3D Reconstructions of Indoor Scenes. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2432 2443, July 2017. |
| Dataset Splits | Yes | We adopt the official training and validation dataset splitting of Scan Net, and then resize and crop original images to centered images at 256 256 resolution. ... The same Io U-based sampling strategy is also adopted in our evaluation, where we select 1,860 context image pairs to formulate the validation set. |
| Hardware Specification | Yes | We conduct training on 8 NVIDIA Ge Force RTX 4090 GPUs, with our model trained for 100 epochs using a per-GPU batch size of 3 (total batch size of 24) for about 2 hours. ... training devices 8 * RTX 4090 |
| Software Dependencies | No | The paper mentions several models and optimizers like Adam W[57], CLIP text encoder[53], Vision Transformer, DPT head[54], Mask Former[38], and Mask2Former[39], but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We conduct training on 8 NVIDIA Ge Force RTX 4090 GPUs, with our model trained for 100 epochs using a per-GPU batch size of 3 (total batch size of 24) for about 2 hours. Adam W optimizer[57] is employed with an initial learning rate of 1e-4 followed by cosine decay scheduling. ... Table I (b): Hyperparameters Loss Weights (λ1, λ2, λ3, λ4, λ5 1.0, 0.5, 0.05, 0.05, 1), Training Details (learning rate scheduler Cosine, epochs 100, learning rate 1e-4, batch size on each device 3, optimizer Adam W[57], beta1, beta2 0.9, 0.95, weight decay 0.05, warm-up epochs 3, gradient clip 1.0). |