Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision

Authors: Chenshuang Zhang, Kang Zhang, Joon Son Chung, In So Kweon, Junmo Kim, Chengzhi Mao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our TED method outperforms 17 popular self-supervised models, achieving state-of-the-art performance in pixel-level object tracking. On the widely-used DAVIS-2017 benchmark [35], our TED significantly outperforms recent self-supervised methods [23, 45, 36, 31] by up to 6%.
Researcher Affiliation Academia Chenshuang Zhang1 Kang Zhang1 Joon Son Chung1 In So Kweon1 Junmo Kim1 Chengzhi Mao2 KAIST1, Rutgers University2
Pseudocode Yes Algorithm 1: Temporal Enhanced Diffusion Tracking (TED) Input: Video frames I1, I2, . . . , IN; Ground-truth label Y1 for I1; Video diffusion model UNetv; Image diffusion model UNeti; Denoising steps: τv (video diffusion) and τi (image diffusion); Block index for features: nv (video diffusion) and ni (image diffusion); Fusion weight λ. Output: Label predictions Y2, Y3, . . . , YN for frames I2, . . . , IN.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We will release the code and data.
Open Datasets Yes On the widely-used DAVIS-2017 benchmark [35], our TED significantly outperforms recent self-supervised methods [23, 45, 36, 31] by up to 6%. When evaluated on videos that include multiple similar-looking objects, our TED method achieves even larger improvement by up to 10%. Visualizations confirm that our representations encode differently for similar looking objects with different motion. Our approach also achieves significant improvement in other challenging scenarios, such as appearance-identical objects, realworld viewpoint changes, and object deformations. Project page: https://chenshuang-zhang.github.io/projects/ted. ... We introduce Youtube-Similar, including 28 videos featuring similar-looking objects from Youtube-VOS [56], totally 839 frames and 69 objects. ... To eliminate these variations, we introduce Kubric-Similar, including 30 videos (480 frames, 60 objects) in which two identical-looking balls move independently. The dataset is generated by Kubric simulator [18], with random ball colors, sizes, and motions.
Dataset Splits Yes Standard benchmark. We follow previous work [32, 27, 52, 24, 36, 31] and evaluate on the widely-used DAVIS-2017 validation set [35], which contains 30 videos (2023 frames, 59 objects).
Hardware Specification Yes We compare computation cost with prior methods in Table 5, tested on a single A100 GPU using DAVIS videos.
Software Dependencies No The paper does not explicitly state specific software versions (e.g., Python, PyTorch, CUDA versions) used in the experiments.
Experiment Setup Yes Table 4: Experimental setups of TED for video label propagation. Dataset Video diffusion Model Timestep Block Image diffusion Model Timestep Block Fusion weight Softmax temp radius k for top-k DAVIS I2VGen-XL 300 3 ADM 51 8 0.4 0.2 15 10 Youtube-Similar I2VGen-XL 600 3 ADM 51 8 0.6 0.1 15 10 Kubric-Similar I2VGen-XL 900 3 ADM 51 8 1.0 0.1 15 10