Explicit Visual Prompts for Visual Object Tracking

Authors: Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Shengping Zhang, Xianxian Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on six benchmarks (i.e., La SOT, La SOText, GOT10k, UAV123, Tracking Net, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a realtime speed by effectively exploiting both spatio-temporal and multi-scale information.
Researcher Affiliation Academia Liangtao Shi1,2, Bineng Zhong1,2*, Qihua Liang1,2, Ning Li1,2, Shengping Zhang3, Xianxian Li1,2 1 Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University 2 Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University 3 Harbin Institute of Technology
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack.
Open Datasets Yes EVPTrack is trained on the same datasets as mainstream trackers(Ye et al. 2022), including La SOT(Fan et al. 2019), GOT-10k (Huang, Zhao, and Huang 2021), Tracking Net(M uller et al. 2018), COCO (Lin et al. 2014).
Dataset Splits No The paper mentions training on specific datasets like GOT-10k and states some training strategies, but does not provide specific validation dataset splits (e.g., percentages, sample counts, or explicit references to predefined validation splits) for reproducibility of data partitioning.
Hardware Specification Yes Our trackers were trained on 4 NVIDIA A10 GPUs. During the inference phase, the trackers were tested at speed on a single NVIDIA RTX2080Ti.
Software Dependencies Yes Our methods are implemented based on python3.8 and pytorch1.10 framework.
Experiment Setup Yes Template size: 112x112 pixels. Search region size: 224x224 pixels. (EVPTrack-224) Template size: 192x192 pixels. Search region size: 384x384 pixels. (EVPTrack-384) We use Hi Vi T-Base(Zhang et al. 2023) model as the Image-Prompt Encoder and its parameters are initialized with MAE(He et al. 2022). The learning rate of backbone is set to 1 10 5, the learning rate decay is set to 1 10 4, and the learning rate of other parameters is set to 1 10 4. A total of 150 epochs of training, and each epoch uses 60k search images. The learning rate decreases by factor after 120 epochs. For EVPTrack-224, we set N, M to 8 and 4, respectively, with a batch size of 32. EVPTrack-224 is trained on 4 GPUs, so the total batch size is 128.