reproducibilityindex.ai

Unifying Visual and Vision-Language Tracking via Contrastive Learning

Authors: Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, Mengxue Kang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results demonstrate that UVLTrack achieves promising performance on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets.
Researcher Affiliation	Academia	Yinchao Ma1, Yuyang Tang1, Wenfei Yang1, Tianzhu Zhang1*, Jinpeng Zhang2, Mengxue Kang2 1Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China 2Intelligent Science Technology Academy of CASIC
Pseudocode	No	The paper describes its methods mathematically and in prose, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Codes and models will be open-sourced at https://github.com/Open Space AI/UVLTrack.
Open Datasets	Yes	We train our model on the training splits of La SOT (Fan et al. 2019), GOT-10k (Huang et al. 2019), COCO2017 (Lin et al. 2014), Tracking Net (Muller et al. 2018), TNL2K (Wang et al. 2021b), OTB99 (Li et al. 2017) and Ref COCOg-google (Mao et al. 2016).
Dataset Splits	No	The paper mentions 'training splits' and 'test sets' but does not explicitly specify details for a separate validation set (e.g., percentages, counts, or explicit use of a predefined validation split).
Hardware Specification	Yes	The experiments are conducted on a server with eight 24GB NVIDIA RTX 3090 GPUs.
Software Dependencies	Yes	Our tracker is implemented using Python 3.8.13 and Pytorch 1.10.1.
Experiment Setup	Yes	We crop the template and search region by 22 and 42 times the target bounding box area and resize them to 128 128 and 256 256 respectively. The test image for first frame grounding is scaled such that its long edge is 256. Image patch size p=16. For language, the max length of the sentence Nl is set to 40. (...) The number of encoder layers is set to N=6,M=6 for UVLTrack-B and N=12,M=12 for UVLTrack-L. (...) The loss weights are set to λgiou=2.0, λ1=5.0, λmmc=0.1.