Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Dynamic View Synthesis as an Inverse Problem

Authors: Hidir Yesiltepe, Pinar Yanardag

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate that dynamic view synthesis can be effectively performed through structured latent manipulation in the noise initialization phase. Comprehensive experiments demonstrate that dynamic view synthesis can be effectively performed through structured latent manipulation in the noise initialization phase. Our framework produces visually coherent results under various viewpoints and demonstrates strong temporal alignment with the source footage. For video samples, please refer to the Supplementary Material.
Researcher Affiliation Academia Hidir Yesiltepe Pinar Yanardag EMAIL EMAIL Virginia Tech
Pseudocode Yes Algorithm 1 Stochastic Latent Modulation 1: Input: x RB F C H W, ϵ RB F C H W, M {0, 1}B F C H W, D RB F C H W 2: Output: Modulated x, Modulated ϵ 3: Compute visibility mask V = (1 M) D 4: Let Isource = {i | Vi = 1} 5: Let Itarget = {i | Mi = 1} 6: for each i Itarget do 7: Sample j Uniform(Isource) 8: Set ϵi = ϵj 9: Set xi = xj 10: end for 11: return x, ϵ
Open Source Code No Answer: [Yes] Justification: We will make the code public. People are welcome to validate our results.
Open Datasets Yes We construct a dataset of 1100 videos to evaluate performance across varying content and motion complexity: 1000 from Open Vid-1M [41], 50 from DAVIS [43], and 50 AI-generated videos.
Dataset Splits Yes We construct a dataset of 1100 videos to evaluate performance across varying content and motion complexity: 1000 from Open Vid-1M [41], 50 from DAVIS [43], and 50 AI-generated videos. Open Vid-1M provides semantically rich scenes, DAVIS offers highmotion content for testing temporal stability, and AI-generated samples assess generalization to synthetic inputs. Each video is rendered under 10 canonical camera trajectories including translations, pans, tilts, and arcs, to evaluate robustness under diverse viewpoint shifts. Quantitative comparison of visual quality, camera pose accuracy, and view synchronization on 1000 randomly selected samples from the Open Vid-1M [41] dataset.
Hardware Specification Yes The output resolution is fixed at 480 720, and all experiments are conducted on a single NVIDIA L40 GPU.
Software Dependencies No Our framework is built on the pretrained Cog Video X-5B-I2V model. Inference is performed with 50 steps at a strength of 0.95 to ensure a T > 0. For all quantitative evaluations, we set the classifier-free guidance (CFG) scale to 6.0 and use a recursion order of k = 10 and adaptive order of δ = 3. 3D dynamic point clouds are generated using Depth Crafter [21], following the procedure described in [65].
Experiment Setup Yes Inference is performed with 50 steps at a strength of 0.95 to ensure a T > 0. For all quantitative evaluations, we set the classifier-free guidance (CFG) scale to 6.0 and use a recursion order of k = 10 and adaptive order of δ = 3. 3D dynamic point clouds are generated using Depth Crafter [21], following the procedure described in [65]. We apply DDIM inversion with a positive terminal-SNR noise schedule using 30 steps, and adopt v-prediction in all cases. For quantitative evaluations, we use Cog Video X s modified DDIM sampling method in the reverse trajectory.