Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Tracking and Understanding Object Transformations
Authors: Yihong Sun, Xinyu Yang, Jennifer Sun, Bharath Hariharan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments Datasets. VOST [41] is curated from ego-centric videos in Ego4D [14] and EPIC-Kitchens [8] that contain object transformations from actor-object interactions. The validation set contains 70 videos with an average of 22.3 seconds captured at 60 fps, with 114 object masklets annotated at 5 fps. VSCOS [53] is constructed in a similar fashion from EPIC-Kitchens [8]. Its validation set contains 98 videos with an average of 7.5 seconds captured at 60 fps and object mask annotated at 1 fps. M3-VOS [5] models object phase changes and contains limited camera motion due to its source from online videos. The entire dataset serves as evaluation, containing 479 videos, 526 masklets, with an average of 14.3 seconds captured at 30 fps. Also, we evaluate on DAVIS 2017 [32] to confirm tracking performance for objects that are not undergoing transformations. VOST-TAS (Track Any State): We introduce a new benchmark for evaluating the proposed task by manually annotating transformations in the VOST [41] val set. Each object instance includes a list of transformations with temporal boundaries (start/end frames), action verb descriptions, and a list of resulting objects with segmentation masks and text descriptions on the end frame per transformation. In total, it contains 57 video instances, 108 transformations, and 293 annotated resulting objects.2 Implementation Details. For Tubelet Graph, we adopt SAM2.1-L [35], Crop Former-Hornet3X [33], FC-CLIP-COCO [54]. Hyperparameters for all three models are kept as default and not tuned further. In addition, we adopt GPT-4.1 [1] and keep sampling temperature at 0. To reason about new candidate entities, we select τprox = 0.3 and τsem = 0.7 after sweeping intervals of 0.1 on VOST train split that is similar sized as VOST val and applied to other datasets without any further modification. In addition, we arbitrarily ignore any entities smaller than 1/252 of the video frame and set the coverage threshold for initiating new tracks τcoverage = 0.25 without further tuning. 4.1 Object Tracking To measure object tracking performance, we follow VOST [41] and report Jaccard J and Jtr (only over last 25% frames), along with per-pixel precision P and recall R. For a more fine-grain analysis, we divide each dataset into three equal subsets: small (S), medium (M), and large (L), based on the average object size throughout the video. |
| Researcher Affiliation | Academia | Yihong Sun Cornell University Xinyu Yang Cornell University Jennifer J. Sun Cornell University Bharath Hariharan Cornell University |
| Pseudocode | No | The paper describes the overview of the proposed Tubelet Graph in Section 3.1 and 3.2, including how it partitions the video, finds missing objects, and understands transformations, accompanied by Figure 2, which is an overview diagram. However, it does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io. |
| Open Datasets | Yes | Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io. Datasets. VOST [41] is curated from ego-centric videos in Ego4D [14] and EPIC-Kitchens [8] that contain object transformations from actor-object interactions. VOST-TAS (Track Any State): We introduce a new benchmark for evaluating the proposed task by manually annotating transformations in the VOST [41] val set. |
| Dataset Splits | Yes | VOST [41] validation set. VOST-TAS (Track Any State): We introduce a new benchmark for evaluating the proposed task by manually annotating transformations in the VOST [41] val set. To measure object tracking performance, we follow VOST [41] and report Jaccard J and Jtr (only over last 25% frames), along with per-pixel precision P and recall R. For a more fine-grain analysis, we divide each dataset into three equal subsets: small (S), medium (M), and large (L), based on the average object size throughout the video. |
| Hardware Specification | Yes | The main efficiency bottleneck of Tubelet Graph is constructing a spatiotemporal partition by tracking every spatial region, which costs on average 7 seconds per frame on VOST [41] with one NVIDIA RTX A6000 GPU. |
| Software Dependencies | Yes | For Tubelet Graph, we adopt SAM2.1-L [35], Crop Former-Hornet3X [33], FC-CLIP-COCO [54]. Hyperparameters for all three models are kept as default and not tuned further. In addition, we adopt GPT-4.1 [1] and keep sampling temperature at 0. |
| Experiment Setup | Yes | For Tubelet Graph, we adopt SAM2.1-L [35], Crop Former-Hornet3X [33], FC-CLIP-COCO [54]. Hyperparameters for all three models are kept as default and not tuned further. In addition, we adopt GPT-4.1 [1] and keep sampling temperature at 0. To reason about new candidate entities, we select τprox = 0.3 and τsem = 0.7 after sweeping intervals of 0.1 on VOST train split that is similar sized as VOST val and applied to other datasets without any further modification. In addition, we arbitrarily ignore any entities smaller than 1/252 of the video frame and set the coverage threshold for initiating new tracks τcoverage = 0.25 without further tuning. |