Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation
Authors: Chaoyang Wang, Ashkan Mirzaei, Vidit Goel, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Vladislav Shakhrai, Sergei Korolev, Sergey Tulyakov, Peter Wonka
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the 4D video generation capability across a combination of datasets: 1) Generated videos. ... 2) Objaverse. ... 3) Nvidia Dynamic Dataset. ... For all datasets, we prepared fixed-view videos as reference inputs and freeze-time videos to condition the view points for generation. ... On datasets with ground truth frames (Objaverse and NVIDIA Dynamic), we evaluate reconstruction quality using standard metrics: PSNR, SSIM, and LPIPS. For the generated video dataset, we subsample outputs into fixed-view and freeze-time videos. We assess multi-view consistency using Met3R [91], and evaluate visual quality of the fixed-view videos using the widely adopted VBench [90] metrics. |
| Researcher Affiliation | Collaboration | Chaoyang Wang1 Ashkan Mirzaei 1 Vidit Goel1 Willi Menapace1 Aliaksandr Siarohin1 Avalon Vinella1 Michael Vasilkovsky1 Ivan Skorokhodov1 Vladislav Shakhrai1 Sergey Korolev1 Sergey Tulyakov1 Peter Wonka1,2 1Snap Inc. 2KAUST |
| Pseudocode | No | The paper describes methods in paragraph text and uses mathematical equations, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Justification: code to be released upon internal approval. |
| Open Datasets | Yes | Evaluation datasets. We evaluate the 4D video generation capability across a combination of datasets: 1) Generated videos. ... 2) Objaverse. Following SV4D, we collect 19 animated 3D assets from Objaverse [45]... 3) Nvidia Dynamic Dataset. This dataset [85] contains 9 dynamic scenes... The training data includes: 1) Synthetic multi-view videos, rendered using animated 3D assets [45] and physics-based simulations [84]. 2) 2D transformed videos... 3) 3D videos... The first training stage uses both synthetic and real-world static datasets. We include Real Estate10K [54], DL3DV [87], MVImage Net [88], Kubric [84] (only single-timestep samples), and ACID [89]. |
| Dataset Splits | Yes | Finally, in Table S6 we present a quantitative diagnosis of the full pipeline on the Nvidia Dynamic Scene dataset, holding out views 1, 4, 7, and 10 at every timestep for evaluation. ... In this experiment, we use four views as input for each scene and the rest as targets. |
| Hardware Specification | Yes | We train the model on 48 A100 GPUs with a batch size of 96... Profiling was performed on an A100 GPU using float32 precision... The training is done on 32 A100 GPUs for around a day. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | All models are trained under identical settings for 4,000 iterations with a batch size of 96. ... Training Setup. We train the model on 48 A100 GPUs with a batch size of 96, using sequences of 8 views and 29 frames. The learning rate is set to 1e 4 with a warm-up schedule. The model converges quickly and begins producing plausible results after approximately 2000 iterations. We switch to fine-tuning the model on sequences with 8 views and 61 frames at 4000 iterations and the finetuning continues for an additional 2000 iterations. ... Static and dynamic training use batch sizes of 14 and 1, respectively, and learning rates of 0.0002 and 0.00002. We sample uniformly across datasets in both stages. Static training runs for 20K iterations, and dynamic training runs for 15K iterations. |