Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Efficient Fuse-and-Refine for Feed-Forward 3D Gaussian Splatting

Authors: Yiming Wang, Lucy Chai, Xuan Luo, Michael Niemeyer, Manuel Lagunas, Stephen Lombardi, Siyu Tang, Tiancheng Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate that our approach achieves state-of-the-art performance on both static and streaming-based dynamic scene reconstruction, while maintaining interactive runtimes.
Researcher Affiliation Collaboration Yiming Wang ETH Zurich Lucy Chai Google Xuan Luo Google Michael Niemeyer Google Manuel Lagunas Google Stephen Lombardi Google Siyu Tang ETH Zurich Tiancheng Sun Google
Pseudocode No The paper describes the methods in detailed paragraphs and uses diagrams (e.g., Figure 1, Figure 5) to illustrate the workflow, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Answer: [No] Justification: The code is not publicly available at the time of submission. However, the datasets used in the experiments are publicly accessible, and sufficient implementation details are provided to support reproducibility. The authors are also willing to assist with any reproduction-related issues.
Open Datasets Yes We benchmark our method on two widely used datasets, Real Estate10K [87] and DL3DV [35], which cover both indoor scenes and unbounded large-scale environments.
Dataset Splits Yes For the Real Estate10K dataset, we set the training and testing resolution to 256 256. ... Each training batch consists of two input views and six target views with a baseline of one unit length between the input views, following the training setup of GS-LRM. For evaluation, we use the same input and target indices as Pixel Splat and GS-LRM. For the DL3DV dataset, we set the training and testing resolution to 384 216. For training, we randomly select one image in the scene as the target, and randomly select four of the nearest eight cameras to the target as inputs. ... For evaluation, we use every eighth image as the target set, and for each target we use the nearest four cameras not in the target set as inputs.
Hardware Specification Yes Our approach achieves state-of-the-art performance in both static and streaming scene reconstructions while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU. Our framework is implemented in JAX [2] and trained on NVIDIA A100 GPUs.
Software Dependencies No Our framework is implemented in JAX [2] and trained on NVIDIA A100 GPUs. ... We use the Adam optimizer [24] ... We use TAPIR [11] as our 2D tracking backbone ... We use the official splatting-based rasterizer from 3D Gaussian Splatting [22] ... We use the open-source 3DGStream [63] implementation ... We use the open-source 4DGS [74] implementation ... We enhance GS-LRM s speed by replacing the CNN deconvolution with an MLP unpatchify layer as proposed in [21], and implementing it in JAX.
Experiment Setup Yes We use the Adam optimizer [24] with an initial learning rate of 4e-4, applying cosine learning rate decay with linear warmup. The warmup period is set to 5000 training steps. Our network consists of a Multi-view Transformer backbone and a Sparse Voxel Transformer. The multi-view transformer has 24 transformer layers with 1024 hidden dimensions and 16 attention heads. The sparse voxel transformer uses 6 layers with 128 hidden dimensions and 8 attention heads. We train our full-scale model for cross-dataset generalization on the DL3DV dataset [35] with a batch size of 128 for a total of 300K iterations using a two-stage training strategy. In the first stage, we train the multi-view Transformer backbone for 200K iterations, followed by joint fine-tuning of both the multi-view and voxel Transformers for an additional 100K iterations.