MixFormerV2: Efficient Fully Transformer Tracking
Authors: Yutao Cui, Tianhui Song, Gangshan Wu, Limin Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments, We evaluate the performance of our proposed trackers on 6 benchmark datasets: including the large-scale La SOT [20], La SOText [20], Tracking Net [42], UAV123 [41], TNL2K [48] and VOT2022 [30]. |
| Researcher Affiliation | Academia | Yutao Cui Tianhui Song Gangshan Wu Limin Wang State Key Laboratory for Novel Software Technology, Nanjing University, China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/MCG-NJU/Mix Former V2 |
| Open Datasets | Yes | The training datasets includes Tracking Net [42], La SOT [20], GOT-10k [28] and COCO [35] training splits., which are the same as Mix Former [14]. |
| Dataset Splits | No | The paper lists training datasets (Tracking Net, La SOT, GOT-10k, COCO) and test datasets, but does not explicitly specify a separate validation dataset split. |
| Hardware Specification | Yes | The distillation training is conducted on 8 NVidia Quadro RTX 8000 GPUs. The inference process runs on one NVidia Quadro RTX 8000 GPU and Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz. |
| Software Dependencies | Yes | Our trackers are implemented using Python 3.6 and Py Torch 1.7. |
| Experiment Setup | Yes | Each distillation training stage takes 500 epochs, where the first m = 40 epochs are for progressively eliminating layers. We train the score prediction MLP for additional 50 epochs. The batch size is 256, each GPU holding 32 samples. We use Adam W optimizer with weight decay of 10 4. The initial learning rate is 10 4 and will be decreased to 10 5 after 400 epochs. We use horizontal flip and brightness jittering for data augmentation. The resolutions of search and template images for Mix Former V2-B are 288 288 and 128 128 respectively. While for Mix Former V2-S, the resolutions of search and template images are 224 224 and 112 112 for real-time tracking on CPU platform. |