SparseTT: Visual Tracking with Sparse Transformers

Authors: Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, Yunhong Wang

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that, without bells and whistles, our method significantly outperforms the state-of-the-art approaches on La SOT, GOT-10k, Tracking Net, and UAV123, while running at 40 FPS.
Researcher Affiliation Academia Zhihong Fu , Zehua Fu , Qingjie Liu , Wenrui Cai and Yunhong Wang State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China Hangzhou Innovation Institute, Beihang University {fuzhihong, zehua fu, qingjie.liu, wenrui cai, yhwang}@buaa.edu.cn
Pseudocode No The paper describes mathematical formulas for its components but does not provide structured pseudocode blocks or algorithm listings.
Open Source Code Yes The source code and models are available at https://github.com/fzh0917/Sparse TT.
Open Datasets Yes We use the train splits of Tracking Net [Muller et al., 2018], La SOT [Fan et al., 2019], GOT10k [Huang et al., 2019], ILSVRC VID [Russakovsky et al., 2015], ILSVRC DET [Russakovsky et al., 2015] and COCO [Lin et al., 2014] as the training dataset, in addition to the GOT-10k [Huang et al., 2019] benchmark.
Dataset Splits Yes We use the train splits of Tracking Net [Muller et al., 2018], La SOT [Fan et al., 2019], GOT10k [Huang et al., 2019], ILSVRC VID [Russakovsky et al., 2015], ILSVRC DET [Russakovsky et al., 2015] and COCO [Lin et al., 2014] as the training dataset... The whole training process takes about 60 hours on 4 NVIDIA RTX 2080 Ti GPUs. Note that the training time of Trans T is about 10 days (240 hours), which is 4 that of our method.
Hardware Specification Yes The whole training process takes about 60 hours on 4 NVIDIA RTX 2080 Ti GPUs.
Software Dependencies No The paper mentions using "Adam W optimizer" but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python version, PyTorch/TensorFlow version, CUDA version).
Experiment Setup Yes In the MSA, SMSA, and MCA, the number of heads is set to 8, the number of channels in the hidden layers of FFN is set to 2048, and the dropout rate is set to 0.1. The number of encoder layers N and the number of decoder layers M are set to 2, and the sparseness K in SMSA is set to 32... We use Adam W optimizer to train our method for 20 epochs... The batch size is set to 32, and the learning rate and the weight decay are both set to 1 10 4. After training for 10 epochs and 15 epochs, the learning rate decreases to 1 10 5 and 1 10 6, respectively.