Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Track-On: Transformer-based Online Point Tracking with Memory
Authors: Görkay Aydemir, Xiongyi Cai, Weidi Xie, Fatma Guney
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that Track-On sets a new state-of-the-art for online models and delivers superior or competitive results compared to offline approaches on seven datasets, including the TAP-Vid benchmark. Our method offers a robust and scalable solution for real-time tracking in diverse applications.1 |
| Researcher Affiliation | Academia | G orkay Aydemir1 Xiongyi Cai3 Weidi Xie3 Fatma G uney1,2 1Department of Computer Engineering, Koc University 2KUIS AI Center 3School of Artificial Intelligence, Shanghai Jiao Tong University |
| Pseudocode | No | The paper describes the methodology using textual explanations and figures, but no explicitly labeled pseudocode or algorithm blocks are present. |
| Open Source Code | No | Project page: https://kuis-ai.github.io/track_on |
| Open Datasets | Yes | We use TAP-Vid (Doersch et al., 2022) for both training and evaluation, consistent with previous work. Specifically, we train our model on TAP-Vid Kubric, a synthetic dataset of 11k video sequences, each with a fixed length of 24 frames. For evaluation, we use three other datasets from the TAP-Vid benchmark: TAP-Vid DAVIS, which includes 30 real-world videos from the DAVIS dataset; TAP-Vid RGB-Stacking, a synthetic dataset of 50 videos focused on robotic manipulation tasks, mainly involving textureless objects; TAP-Vid Kinetics, a collection of over 1,000 real-world online videos. We provide comparisons on four additional datasets in Appendix Sec. C. |
| Dataset Splits | Yes | We use TAP-Vid (Doersch et al., 2022) for both training and evaluation, consistent with previous work. Specifically, we train our model on TAP-Vid Kubric, a synthetic dataset of 11k video sequences, each with a fixed length of 24 frames. For evaluation, we use three other datasets from the TAP-Vid benchmark: TAP-Vid DAVIS, which includes 30 real-world videos from the DAVIS dataset; TAP-Vid RGB-Stacking, a synthetic dataset of 50 videos focused on robotic manipulation tasks, mainly involving textureless objects; TAP-Vid Kinetics, a collection of over 1,000 real-world online videos. We follow the standard protocol of TAP-Vid benchmark by first downsampling the videos to 256 256. |
| Hardware Specification | Yes | The model is optimized using the Adam W optimizer (Loshchilov & Hutter, 2019) on 32 A100 64GB GPUs, with mixed precision. The results are based on tracking approximately 400 points on a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions models like DINOv2 and Vi T-Adapter, and optimizers like Adam W, but does not provide specific version numbers for software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions. |
| Experiment Setup | Yes | We train our model for 150 epochs, equivalent to approximately 50K iterations, using a batch size of 32. The model is optimized using the Adam W optimizer (Loshchilov & Hutter, 2019) on 32 A100 64GB GPUs, with mixed precision. The learning rate is set to a maximum of 5 10 4, following a cosine decay schedule with a linear warmup period covering 5% of the total training time. A weight decay of 1 10 5 is applied, and gradient norms are clipped at 1.0 to ensure stable training. Input frames are resized to 384 512 using bilinear interpolation before processing. Each training sample includes up to N = 480 points. We apply random key masking with a 0.1 ratio during attention calculations for memory read operations throughout training. For the training loss coefficients, we set λ to 3. During training, we clip the offset loss to the stride S to prevent large errors from incorrect patch classifications and stabilize the loss. Deep supervision is applied to offset head (Φoff), and the average loss across layers is used. We set the softmax temperature τ to 0.05 in patch classification. We set the visibility threshold to 0.8 for all datasets except RGB-Stacking, where it is set to 0.5 due to its domain-specific characteristics, consisting of simple, synthetic videos. |