TAPTRv2: Attention-based Position Update Improves Tracking Any Point

Authors: Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Feng Li, Bohan Li, Tianhe Ren, Lei Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on multiple challenging evaluation datasets collected from real world to verify the superiority of TAPTRv2. Detailed ablation studies for our main contribution are also provided to show the effectiveness of each design in modeling.
Researcher Affiliation Collaboration Hongyang Li1,2 Hao Zhang2,3 Shilong Liu2,4 Zhaoyang Zeng2 Feng Li2,3 Tianhe Ren2 Bohan Li5 Lei Zhang1,2 1South China University of Technology. 2International Digital Economy Academy (IDEA). 3The Hong Kong University of Science and Technology. 4Dept. of CST., BNRist Center, Institute for AI, Tsinghua University. 5Shanghai Jiao Tong University.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No All the data are open to be derived from the paper we cite in Sec. 4. And our code will be available after the double-blind review process.
Open Datasets Yes Following previous works [26, 20, 14, 9] we train TAPTRv2 on the Kubric dataset, which consists of 11,000 synthetic videos generated by Kubric Engine [12]. We evaluate our method on the challenging TAP-Vid-DAVIS [34] and TAP-Vid-Kinetics [5] datasets.
Dataset Splits No The paper describes training on the Kubric dataset and evaluating on TAP-Vid-DAVIS and TAP-Vid-Kinetics datasets, but it does not specify explicit train/validation/test splits for any of these datasets with percentages or counts, or how a validation set is derived from the training data.
Hardware Specification Yes We use 8 NVIDIA A100 GPUs, accumulating gradients 4 times to approximate a total batch size of 32, and train TAPTRv2 for approximately 44,000 iterations. [...] We evaluate TAPTR and TAPTRv2 on A100 GPU (80G), and the computational cost (GFLOPS) is calculated following detectron2.
Software Dependencies No The paper mentions optimizers (Adam W, EMA) and the use of ResNet-50 as a backbone, but does not specify software dependencies with version numbers like Python, PyTorch, or CUDA.
Experiment Setup Yes We follow the previous work [26] and use Res Net-50 as the image backbone for both experimental efficiency and fair comparison. We employ two Transformer encoder layers with deformable attention [57] to enhance feature quality, and five Transformer decoder layers by default to achieve the results that are fully optimized. We use Adam W [58] and EMA [21] for training. We use 8 NVIDIA A100 GPUs, accumulating gradients 4 times to approximate a total batch size of 32, and train TAPTRv2 for approximately 44,000 iterations.