Diffusion Mask-Driven Visual-language Tracking

Authors: Guangtong Zhang, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shuxiang Song

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on four tracking benchmarks (i.e., La SOT, TNL2K, La SOText, and OTB-Lang), we validate that our proposed Diffusion Mask Driven Visual-language Tracker can improve the robustness and effectiveness of the model.
Researcher Affiliation Academia 1Key Laboratory of Education Blockchain and Intelligent Technology Ministry of Education, Guangxi Normal University, Guilin 541004, China. 2Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China. 3Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University,Wuzhou 543002, China.
Pseudocode No The paper includes figures illustrating the framework and processes but no structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes Through extensive experiments on four tracking benchmarks (i.e., La SOT, TNL2K, La SOText, and OTB-Lang)
Dataset Splits Yes We use the training splits of TNL2k[Xiao et al., 2021], La SOT[Fan et al., 2018], OTB-Lang[Zhenyang et al., 2017b], and Ref COCOgoogle[Junhua et al., 2016] multiple training sets for joint training.
Hardware Specification Yes Our model was implemented in the Pytorch framework on a server with 1 NVIDIA V100 GPU. ... We tested the proposed tracker on an NVIDIA 3080 GPU, and the single sample tracking speed is about 40 FPS.
Software Dependencies No The paper mentions implementing in "Pytorch framework" but does not specify a version number for Pytorch or any other software dependencies with their versions.
Experiment Setup Yes Our model is trained with 100 epochs, each epoch with 60,000 image pairs and each mini-batch with 64 sample pairs. We also train the model using the Adam W optimizer, set the weight decay to 10-4, the initial learning rate of the backbone to 2 x 10-5, and other parameters to 2 x 10-4. After 80 epochs, the learning rate is decreased by a factor of 10.