Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles

Authors: Peng Wang, Xiang Liu, Peidong Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on both in-domain and out-of-domain datasets. It outperforms prior approaches in terms of both multi-view consistency and efficiency, which producing high-quality 3D stylized content in only 0.15 second. [...] Datasets. We use a combination of Real Estate10K (RE10K) [56] and DL3DV [22] as our scene dataset [...] To evaluate zero-shot generalization, we test on the Tanks and Temples [16] dataset [...] Evaluation Metrics. Because of the novel and under-explored nature of 3D stylization, there are few metrics for assessing the quality of the stylization. Therefore, we evaluate the multi-view consistency as in prior 3D stylization works [8, 23, 24]. We estimate optical flow between sequential images using RAFT [41], then warp the earlier frame with softmax splatting [29]. Consistency is measured by LPIPS [55] and RMSE between the warped and target images over valid pixels. Shortand longrange consistency are computed between adjacent views and those seven frames apart, respectively. We further employed Art FID [46], a metric well aligned with human perceptual judgment by jointly assessing content preservation and style fidelity, together with the RGB-uv histogram from Histo GAN [1] to comprehensively evaluate the quality of color transfer. To evaluate novel view synthesis quality, we report standard image similarity metrics: PSNR, SSIM, and LPIPS [55]. [...] 4.2 Ablation Studies
Researcher Affiliation Academia Peng Wang1,2 Xiang Liu2 Peidong Liu2 1 Zhejiang University 2 Westlake University EMAIL, EMAIL
Pseudocode No The paper describes the model architecture and training process in sections 3.1, 3.2, and 3.3, and illustrates the overall pipeline in Figure 2. It also provides a loss function as Equation 2. However, it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The code will not be submitted along with paper, but we may release the code later on Github.
Open Datasets Yes Datasets. We use a combination of Real Estate10K (RE10K) [56] and DL3DV [22] as our scene dataset, covering both indoor and outdoor videos with diverse camera motion patterns. For style supervision, we use Wiki Art [32], and assign a unique style image to each scene in the training and evaluation sets. This setup ensures that neither the test scenes nor styles were seen during training. To evaluate zero-shot generalization, we test on the Tanks and Temples [16] dataset which is widely used by prior 3D style transfer methods [8, 23, 24, 54]. [...] more out-of-domain stylization results on the Tanks and Temples and Ne RF LLFF [26] datasets;
Dataset Splits No Datasets. We use a combination of Real Estate10K (RE10K) [56] and DL3DV [22] as our scene dataset, covering both indoor and outdoor videos with diverse camera motion patterns. For style supervision, we use Wiki Art [32], and assign a unique style image to each scene in the training and evaluation sets. This setup ensures that neither the test scenes nor styles were seen during training. To evaluate zero-shot generalization, we test on the Tanks and Temples [16] dataset which is widely used by prior 3D style transfer methods [8, 23, 24, 54]. [...] Progressive Multi-view Training To stabilize multi-view training, we first pre-train the model on the 2-view setting for the NVS task, which is then used to initialize the 4-view NVS training and subsequent stylization fine-tuning. Though trained with 4 input views, our model can flexibly handle 2 to 8 views during inference as shown in Fig. 8.
Hardware Specification Yes Training takes ~1.5 days on 8 NVIDIA A100 GPUs.
Software Dependencies No Implementation details. We use PyTorch. The content and style encoder adopts a standard Vi T-Large architecture with a patch size 16, while the structure and stylization decoder is based on a Vi TBase model. [...] To expedite the inference of network, we use the flash attention implementation from xFormers [18] in all of our encoders and decoders.
Experiment Setup Yes Implementation details. We use PyTorch. The content and style encoder adopts a standard Vi T-Large architecture with a patch size 16, while the structure and stylization decoder is based on a Vi TBase model. We initialize the encoder, decoder, and the Gaussian center prediction head with pretrained weights from MASt3R [19], whereas the remaining layers are initialized randomly. The model is trained on images with a resolution of 256 256. Besides, we use 0 degree spherical harmonics for Gaussians following [8]. [...] A More Implementation Details Training. In terms of optimization, we employ Adam W optimizer. For Novel View Synthesis (NVS) pretraining, we train the stylization decoder, color head and structure head with initial learning rate of 2 10 4, and fine-tune the other parameters with 2 10 5. Then for stylization fine-tuning, we continue optimizing the color head and stylization decoder with initial learning rate of 2 10 4 and fine-tune only the style encoder with 2 10 5, and keep all the other parameters in the structure branch fixed.