SparseTT: Visual Tracking with Sparse Transformers
Authors: Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, Yunhong Wang
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that, without bells and whistles, our method significantly outperforms the state-of-the-art approaches on La SOT, GOT-10k, Tracking Net, and UAV123, while running at 40 FPS. |
| Researcher Affiliation | Academia | Zhihong Fu , Zehua Fu , Qingjie Liu , Wenrui Cai and Yunhong Wang State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China Hangzhou Innovation Institute, Beihang University {fuzhihong, zehua fu, qingjie.liu, wenrui cai, yhwang}@buaa.edu.cn |
| Pseudocode | No | The paper describes mathematical formulas for its components but does not provide structured pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | The source code and models are available at https://github.com/fzh0917/Sparse TT. |
| Open Datasets | Yes | We use the train splits of Tracking Net [Muller et al., 2018], La SOT [Fan et al., 2019], GOT10k [Huang et al., 2019], ILSVRC VID [Russakovsky et al., 2015], ILSVRC DET [Russakovsky et al., 2015] and COCO [Lin et al., 2014] as the training dataset, in addition to the GOT-10k [Huang et al., 2019] benchmark. |
| Dataset Splits | Yes | We use the train splits of Tracking Net [Muller et al., 2018], La SOT [Fan et al., 2019], GOT10k [Huang et al., 2019], ILSVRC VID [Russakovsky et al., 2015], ILSVRC DET [Russakovsky et al., 2015] and COCO [Lin et al., 2014] as the training dataset... The whole training process takes about 60 hours on 4 NVIDIA RTX 2080 Ti GPUs. Note that the training time of Trans T is about 10 days (240 hours), which is 4 that of our method. |
| Hardware Specification | Yes | The whole training process takes about 60 hours on 4 NVIDIA RTX 2080 Ti GPUs. |
| Software Dependencies | No | The paper mentions using "Adam W optimizer" but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python version, PyTorch/TensorFlow version, CUDA version). |
| Experiment Setup | Yes | In the MSA, SMSA, and MCA, the number of heads is set to 8, the number of channels in the hidden layers of FFN is set to 2048, and the dropout rate is set to 0.1. The number of encoder layers N and the number of decoder layers M are set to 2, and the sparseness K in SMSA is set to 32... We use Adam W optimizer to train our method for 20 epochs... The batch size is set to 32, and the learning rate and the weight decay are both set to 1 10 4. After training for 10 epochs and 15 epochs, the learning rate decreases to 1 10 5 and 1 10 6, respectively. |