Unifying Visual and Vision-Language Tracking via Contrastive Learning

Authors: Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, Mengxue Kang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate that UVLTrack achieves promising performance on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets.
Researcher Affiliation Academia Yinchao Ma1, Yuyang Tang1, Wenfei Yang1, Tianzhu Zhang1*, Jinpeng Zhang2, Mengxue Kang2 1Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China 2Intelligent Science Technology Academy of CASIC
Pseudocode No The paper describes its methods mathematically and in prose, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Codes and models will be open-sourced at https://github.com/Open Space AI/UVLTrack.
Open Datasets Yes We train our model on the training splits of La SOT (Fan et al. 2019), GOT-10k (Huang et al. 2019), COCO2017 (Lin et al. 2014), Tracking Net (Muller et al. 2018), TNL2K (Wang et al. 2021b), OTB99 (Li et al. 2017) and Ref COCOg-google (Mao et al. 2016).
Dataset Splits No The paper mentions 'training splits' and 'test sets' but does not explicitly specify details for a separate validation set (e.g., percentages, counts, or explicit use of a predefined validation split).
Hardware Specification Yes The experiments are conducted on a server with eight 24GB NVIDIA RTX 3090 GPUs.
Software Dependencies Yes Our tracker is implemented using Python 3.8.13 and Pytorch 1.10.1.
Experiment Setup Yes We crop the template and search region by 22 and 42 times the target bounding box area and resize them to 128 128 and 256 256 respectively. The test image for first frame grounding is scaled such that its long edge is 256. Image patch size p=16. For language, the max length of the sentence Nl is set to 40. (...) The number of encoder layers is set to N=6,M=6 for UVLTrack-B and N=12,M=12 for UVLTrack-L. (...) The loss weights are set to λgiou=2.0, λ1=5.0, λmmc=0.1.