Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Taming generative video models for zero-shot optical flow extraction

Authors: Seungwoo Kim, Khai Loong Aw, Klemen Kotar, Cristobal Eyzaguirre, Wanhee Lee, Yunong Liu, Jared Watrous, Stefan Stojanov, Juan Carlos Niebles, Jiajun Wu, Daniel L Yamins

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Without any flow-specific fine-tuning, our method is competitive with state-of-the-art, task-specific models on the real-world TAP-Vid DAVIS benchmark and the synthetic TAP-Vid Kubric. Our results show that counterfactual prompting of controllable generative video models is an effective alternative to supervised or photometric-loss methods for high-quality flow.
Researcher Affiliation	Academia	Seungwoo Kim Khai Loong Aw Klemen Kotar Cristobal Eyzaguirre Wanhee Lee Yunong Liu Jared Watrous Stefan Stojanov Juan Carlos Niebles Jiajun Wu Daniel L.K. Yamins Stanford University
Pseudocode	No	The paper describes the methods in text and uses figures to illustrate processes (e.g., Figure 2 for test-time inference procedure, Figure 3 for KL-tracing), but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: All data used is already available openly, code will be released.
Open Datasets	Yes	We use TAP-Vid DAVIS [9] and Kubric [13] for evaluation. TAP-Vid DAVIS contains real-world videos with human-annotated flow and occlusion labels, while Kubric is a synthetic dataset with ground-truth labels. Most models [39, 41] are supervised on synthetic video datasets [12, 28, 6], e.g., Flying Chairs [12], Flying Things [28] and Sintel [6].
Dataset Splits	Yes	Model TAP-Vid DAVIS Subset (3%) Endpoint Error (EPE) LRAS RGB (5MM, 8MS, 2STD) (ours) 8.4797 LRAS KL (5MM, 8MS, 2STD) (ours) 5.0762 Stable Video Diffusion [5] 74.7990 Cosmos (top 10% raster) (5MM, 2STD, 512 512) [30] 35.4338 Cosmos (overwrite 10% during rollout) (5MM, 2STD, 512 512) [30] 37.7552 Cosmos (provide full second frame) (5MM, 2STD, 512 512) [30] 66.5521
Hardware Specification	No	We also thank the Stanford HAI, Stanford Data Sciences, the Marlowe team, and the Google TPU Research Cloud team for their computing support.
Software Dependencies	No	The paper does not provide specific software names with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup	Yes	Step 1. Inject a small perturbation. We duplicate the initial frame F1 and perturb it with a small white bump to form F1, i.e., a Gaussian centered at the query location xq with amplitudes 255 on each RGB channel and standard deviation σ equal to 2.0. ... Step 3. Estimate optical flow with patchwise KL-divergence. ... For every patch (i, j) we compute the KL-divergence: DKL(i, j) = KL (zij) ( z pert ij ). ... Model TAP-Vid DAVIS Subset (3%) Endpoint Error (EPE) LRAS RGB (5MM, 8MS, 2STD) (ours) 8.4797 LRAS KL (5MM, 8MS, 2STD) (ours) 5.0762