DINTR: Tracking via Diffusion-based Interpolation

Authors: Pha Nguyen, Ngan Le, Jackson Cothren, Alper Yilmaz, Khoa Luu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experimental Results
Researcher Affiliation Academia 1University of Arkansas 2Ohio State University 1{panguyen, thile, jcothre, khoaluu}@uark.edu 2yilmaz.15@osu.edu
Pseudocode Yes Algorithm 1 Inplace Reconstruction Finetuning
Open Source Code No The techniques presented in this work are the intellectual property of [Affiliation], and the organization intends to seek patent coverage for the disclosed process.
Open Datasets Yes TAP-Vid [18] formalizes the problem of long-term physical Point Tracking. It contains 31,951 points tracked on 1,219 real videos.
Dataset Splits No The paper mentions fine-tuning on datasets like TAP-Vid and Pose Track, and evaluation on their respective benchmarks. However, it does not explicitly provide details about the specific training/validation/test splits used during their model training (e.g., percentages or sample counts for each split).
Hardware Specification Yes The model is trained on 4 NVIDIA Tesla A100 GPUs with a batch size of 1, comprising a pair of frames.
Software Dependencies No The paper mentions building on 'LDM [13] and ADM [111]' but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes The model is then fine-tuned using our proposed strategy for 500 steps with a learning rate of 3e-5. The model is trained on 4 NVIDIA Tesla A100 GPUs with a batch size of 1, comprising a pair of frames. We average the attention AS and AX in the interval k [0, T 0.8] of the DDIM steps with the total timestep T = 50. For the first frame initialization, we employ YOLOX [112] as the detector, HRNet [113] as the pose estimator, and Mask2Former [114] as the segmentation model. We maintained a linear noise scheduler across all experiments...