Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals

Authors: Markus Gross, Aya Fahmy, Danit Niwattananan, Dominik Muhle, Rui Song, Daniel Cremers, Henri Meeß

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results show that our approach achieves state-of-the-art in-domain performance, exhibits superior zeroshot generalization on out-of-domain data, and achieves a runtime reduction exceeding 14 . These results highlight our introduction of context-adaptive instance proposals as a pioneering effort in addressing vision-based 3D Panoptic Scene Completion.
Researcher Affiliation Collaboration 1Fraunhofer Institute IVI 2Technical University of Munich 3Munich Center for Machine Learning 4University of California, Los Angeles
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations (e.g., Eq. 1-7) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github.com/markus-42/ipformer.
Open Datasets Yes We conduct our experiments by (i) in-domain training and testing on the Semantic KITTI SSC dataset [1], and (ii) out-of-domain zero-shot generalization on the distinct SSCBench-KITTI360 [31].
Dataset Splits Yes The dataset comprises 10 training sequences, 1 validation sequence, and 11 test sequences, with our experiments adhering to the standard split [44] of 3834 training and 815 validation grids.
Hardware Specification Yes We utilize a single NVIDIA A100 80GB GPU, adopt a maximum learning rate of 1 10 4, and implement a cosine adaptive learning rate schedule decay, with a cosine warmup applied over the initial 2 epochs. Our implementation is based on Py Torch [39] with an fp32 backend. Moreover, we operate on a 50 % voxel grid resolution of X = 128, Y = 128, Z = 16 and finally upsample to the ground-truth grid resolution of 256 256 32 via trilinear interpolation. The feature dimension is set to C = 128. Training IPFormer takes approximately 3.5 days for each of the two stages. The second stage training is initialized with the final model state of the first stage, and we eventually present results for the best checkpoint based on PQ . Aligning with [32, 21, 59], we adopt a pretrained Mobile Stereo Net [46] to estimate depth maps, and employ Efficient Net B7 [49] as our image backbone, consistent with [63, 59]. Moreover, the context net consists of a lightweight CNN, while the panoptic head represents a single linear layer for projection to class logits. The deformable cross and self attention blocks during proposal initialization consist of three layers and two layers, respectively, while 8 points are sampled for each reference point. Finally, the cross and self-attention blocks during decoding each consist of three layers.
Software Dependencies No Our implementation is based on Py Torch [39] with an fp32 backend. This mentions PyTorch but does not specify a version number, which is required for a reproducible software dependency description.
Experiment Setup Yes In accordance with [4, 20, 32, 21], we train for 25 epochs in the first stage and 30 epochs in the second stage, using Adam W [36] optimizer with standard hyperparameters Ξ²1 = 0.9, Ξ²2 = 0.99, and a batch size of 1. We utilize a single NVIDIA A100 80GB GPU, adopt a maximum learning rate of 1 10 4, and implement a cosine adaptive learning rate schedule decay, with a cosine warmup applied over the initial 2 epochs.