SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Authors: Liting Lin, Heng Fan, Zhipeng Zhang, Yong Xu, Haibin Ling

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on five large-scale benchmarks to verify the effectiveness of Swin Track, including La SOT [9], La SOText [8], Tracking Net [26], GOT-10k [15] and TNL2k [34].Our codes and results are released at https://github.com/Liting Lin/Swin Track.
Researcher Affiliation Collaboration Liting Lin1,2 Heng Fan3 Zhipeng Zhang4 Yong Xu1,2 Haibin Ling5 1School of Computer Science & Engineering, South China Univ. of Tech., Guangzhou, China 2Peng Cheng Laboratory, Shenzhen, China 3Department of Computer Science and Engineering, University of North Texas, Denton, USA 4Di Di Chuxing, Beijing, China 5Department of Computer Science, Stony Brook University, Stony Brook, USA
Pseudocode No The paper presents architectural diagrams and mathematical formulations, but it does not contain any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our codes and results are released at https://github.com/Liting Lin/Swin Track.
Open Datasets Yes We train Swin Track using the training splits of La SOT [9], Tracking Net [26], GOT-10k [15] (1,000 videos are removed following [36] for fair comparison) and COCO 2017 [22].
Dataset Splits No The paper mentions using "training splits" of various datasets (La SOT, Tracking Net, GOT-10k, COCO 2017) and discusses training details like preventing overfitting ("For the models trained for the GOT-10k evaluation protocol, to prevent over-fitting, the training epoch is set to 150..."), but it does not explicitly specify a distinct validation dataset split, its size, or how it was used in the context of hyperparameter tuning or early stopping.
Hardware Specification Yes We train the network on 8 NVIDIA V100 GPUs for 300 epochs with 131,072 samples per epoch.
Software Dependencies No The paper mentions the optimizer "Adam W [24]" but does not provide specific version numbers for any software libraries (e.g., PyTorch, TensorFlow, or Python) required to replicate the experiments.
Experiment Setup Yes Model. We design two variants of Swin Track with different configurations as follows: Swin Track-T-224. Backbone: Swin Transformer-Tiny [23], pretrained with Image Net-1k; Template size: [112 112]; Search region size: [224 224]; C = 384; N = 4; Swin Track-B-384. Backbone: Swin Transformer-Base [23], pretrained with Image Net-22k; Template size: [192 192]; Search region size: [384 384]; C = 512; N = 8;Training. The model is optimized with Adam W [24], with a learning rate of 5e-4, and a weight decay of 1e-4. The learning rate of the backbone is set to 5e-5. We train the network on 8 NVIDIA V100 GPUs for 300 epochs with 131,072 samples per epoch. The learning rate is dropped by a factor of 10 after 210 epochs. A 3-epoch linear warmup is applied to stabilize the training process. Drop Path [18] is applied on the backbone and the encoder with a rate of 0.1.