Explicit Visual Prompts for Visual Object Tracking
Authors: Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Shengping Zhang, Xianxian Li
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on six benchmarks (i.e., La SOT, La SOText, GOT10k, UAV123, Tracking Net, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a realtime speed by effectively exploiting both spatio-temporal and multi-scale information. |
| Researcher Affiliation | Academia | Liangtao Shi1,2, Bineng Zhong1,2*, Qihua Liang1,2, Ning Li1,2, Shengping Zhang3, Xianxian Li1,2 1 Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University 2 Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University 3 Harbin Institute of Technology |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack. |
| Open Datasets | Yes | EVPTrack is trained on the same datasets as mainstream trackers(Ye et al. 2022), including La SOT(Fan et al. 2019), GOT-10k (Huang, Zhao, and Huang 2021), Tracking Net(M uller et al. 2018), COCO (Lin et al. 2014). |
| Dataset Splits | No | The paper mentions training on specific datasets like GOT-10k and states some training strategies, but does not provide specific validation dataset splits (e.g., percentages, sample counts, or explicit references to predefined validation splits) for reproducibility of data partitioning. |
| Hardware Specification | Yes | Our trackers were trained on 4 NVIDIA A10 GPUs. During the inference phase, the trackers were tested at speed on a single NVIDIA RTX2080Ti. |
| Software Dependencies | Yes | Our methods are implemented based on python3.8 and pytorch1.10 framework. |
| Experiment Setup | Yes | Template size: 112x112 pixels. Search region size: 224x224 pixels. (EVPTrack-224) Template size: 192x192 pixels. Search region size: 384x384 pixels. (EVPTrack-384) We use Hi Vi T-Base(Zhang et al. 2023) model as the Image-Prompt Encoder and its parameters are initialized with MAE(He et al. 2022). The learning rate of backbone is set to 1 10 5, the learning rate decay is set to 1 10 4, and the learning rate of other parameters is set to 1 10 4. A total of 150 epochs of training, and each epoch uses 60k search images. The learning rate decreases by factor after 120 epochs. For EVPTrack-224, we set N, M to 8 and 4, respectively, with a batch size of 32. EVPTrack-224 is trained on 4 GPUs, so the total batch size is 128. |