Diffusion Mask-Driven Visual-language Tracking
Authors: Guangtong Zhang, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shuxiang Song
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on four tracking benchmarks (i.e., La SOT, TNL2K, La SOText, and OTB-Lang), we validate that our proposed Diffusion Mask Driven Visual-language Tracker can improve the robustness and effectiveness of the model. |
| Researcher Affiliation | Academia | 1Key Laboratory of Education Blockchain and Intelligent Technology Ministry of Education, Guangxi Normal University, Guilin 541004, China. 2Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China. 3Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University,Wuzhou 543002, China. |
| Pseudocode | No | The paper includes figures illustrating the framework and processes but no structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Through extensive experiments on four tracking benchmarks (i.e., La SOT, TNL2K, La SOText, and OTB-Lang) |
| Dataset Splits | Yes | We use the training splits of TNL2k[Xiao et al., 2021], La SOT[Fan et al., 2018], OTB-Lang[Zhenyang et al., 2017b], and Ref COCOgoogle[Junhua et al., 2016] multiple training sets for joint training. |
| Hardware Specification | Yes | Our model was implemented in the Pytorch framework on a server with 1 NVIDIA V100 GPU. ... We tested the proposed tracker on an NVIDIA 3080 GPU, and the single sample tracking speed is about 40 FPS. |
| Software Dependencies | No | The paper mentions implementing in "Pytorch framework" but does not specify a version number for Pytorch or any other software dependencies with their versions. |
| Experiment Setup | Yes | Our model is trained with 100 epochs, each epoch with 60,000 image pairs and each mini-batch with 64 sample pairs. We also train the model using the Adam W optimizer, set the weight decay to 10-4, the initial learning rate of the backbone to 2 x 10-5, and other parameters to 2 x 10-4. After 80 epochs, the learning rate is decreased by a factor of 10. |