Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis
Authors: Xiaoyuan Wang, Yizhou Zhao, Botao Ye, Shan Xiaojun, Weijie Lyu, Lu Qi, Kelvin Chan, Yinxiao Li, Ming-Hsuan Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental evaluation demonstrates that Holi GS significantly outperforms state-of-the-art methods in terms of both rendering quality and computational speed, achieving real-time rendering capabilities on consumer hardware. Our results confirm robust performance across diverse, challenging, dynamic sequences featuring multiple interacting entities and complex articulated motions, scenarios where prior techniques either fail or produce substantial visual artifacts. |
| Researcher Affiliation | Collaboration | Xiaoyuan Wang1, Yizhou Zhao1, Botao Ye2, Xiaojun Shan3, Weijie Lyu4, Lu Qi5 , Kelvin C.K. Chan6, Yinxiao Li6, Ming-Hsuan Yang4,6 1CMU 2ETH Zurich 3UC San Diego 4UC Merced 5Insta360 6Google Deep Mind |
| Pseudocode | No | The paper describes the methods textually and with mathematical equations, as well as a pipeline diagram (Figure 3), but it does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | No | The source code will be released. |
| Open Datasets | Yes | Quantitative results for novel view synthesis are reported in Tables 2 and 3. Table 2 presents visual metrics across the Total-Recon dataset, while Table 3 reports depth accuracy metrics (Acc@0.1m and RMS depth error). Our method consistently outperforms the baselines in both sets of metrics. |
| Dataset Splits | Yes | Our experiments are conducted on a newly captured dataset comprising 11 sequences recorded with a stereo camera setup at 30fps, featuring diverse scenes with complex interactions between humans and animals. Each sequence is approximately 0.5-1 minutes long, containing between 400 and 900 frames. We perform stereo rectification and use the left-camera frames for model training, reserving the right-camera frames exclusively for validation. |
| Hardware Specification | Yes | On NVIDIA H20 GPUs, each pre-training or refinement stage completes in about 30 minutes, enabling full scenes (including multiple deformable objects) to converge in two hours, significantly faster than other approaches. |
| Software Dependencies | No | The paper mentions implementing the model using PyTorch and optimizing with Adam but does not provide specific version numbers for PyTorch or any other libraries or tools used for the implementation. |
| Experiment Setup | Yes | We adopt a two-phase procedure to optimize our dynamic Gaussian representation: Component pre-training and joint refinement. During pre-training, each component (e.g., a deformable object or the static background) is optimized separately. Once pre-training is completed, all components are combined for joint refinement using color, depth, normal, and mask objectives. Training follows standard Gaussian Splatting protocols [2]. The synergy between our deformation-centric design and the parametric Gaussian framework accelerates convergence considerably. On NVIDIA H20 GPUs, each pre-training or refinement stage completes in about 30 minutes, enabling full scenes (including multiple deformable objects) to converge in two hours, significantly faster than other approaches. Component pre-training. We initialize the deformation network by minimizing the overall loss (3), with default weights set as: λdepth = 5 (or 1.5 for the HUMAN 1 sequence), λcolor = 0.1, λflow = 1, λcycle = 1, and λsegment = 1. This eikonal term is weighted by λSDF = 0.001 to ensure proper geometric properties. For this computation, we sample 17 uniformly distributed points Xt i along each camera ray rt centered at the surface point derived from back-projecting the ground-truth depth. Joint fine-tuning. During the joint optimization phase, we simultaneously refine all object representations by minimizing loss (5) for an additional 6,000 iterations. The default weights for these objectives are λphoto = 1, λnormal = 1, λdepth = 5, and λseg,j = 1. By default, we freeze the background s appearance and geometry parameters while allowing optimization of its global transformation T b 0, the foreground objects transformations T f t , and the foreground appearance and geometry parameters (for HUMAN 1, we use λdepth = 1.5), we allow background appearance and geometry optimization during joint fine-tuning). This joint fine-tuning phase significantly enhances the visual coherence of foreground elements and improves the modeling of inter-object interactions. |