DiffusionTrack: Diffusion Model for Multi-Object Tracking

Authors: Run Luo, Zikai Song, Lintao Ma, Jinlin Wei, Wei Yang, Min Yang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and Dance Track, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods.
Researcher Affiliation Academia 1Shenzen Institute of Advanced Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Huazhong University of Science and Technology 4University of California, Santa Barbara
Pseudocode No The paper does not include any explicit pseudocode blocks or algorithms.
Open Source Code Yes Code is available at https://github.com/RainBowLuoCS/DiffusionTrack.
Open Datasets Yes We evaluate our method on multiple multi-object tracking datasets including MOT17 (Milan et al. 2016), MOT20 (Dendorfer et al. 2020) and Dance Track (Sun et al. 2022).
Dataset Splits Yes The model is trained on the MOT17 train-half and tested on the MOT17 val-half.
Hardware Specification Yes We train our model on 8 NVIDIA Ge Force RTX 3090 with FP32-precision and a constant seed for all experiments. The run time is evaluated on a single NVIDIA Ge Force 3090 GPU with a mini-batch size of 1 and FP16-precision.
Software Dependencies Yes Our approach is implemented in Python 3.8 with Py Torch 1.10.
Experiment Setup Yes For MOT17, the training schedule consists of 30 epochs on the combination of MOT17, Crowd Human, Cityperson and ETHZ for detection and another 30 epochs on MOT17 solely for tracking. We adopt a warm-up learning rate of 2.5e-5 with a 0.2 warm-up factor on the first 5 epochs. The Adam W (Loshchilov and Hutter 2018) optimizer is employed with an initial learning rate of 1e-4, and the learning rate decreases according to the cosine function with the final decrease factor of 0.1. The size of an input image is resized to 1440 800. The mini-batch size is set to 16, with each GPU hosting two batches with Ntrain = 500. We set association score threshold τconf = 0.25, 3D NMS threshold τnms3d = 0.6, detection score threshold τdet = 0.7 and 2D NMS threshold τnms2d = 0.7 for default hyper-parameter setting. We also use Mosaic (Bochkovskiy, Wang, and Liao 2020) and Mixup (Zhang et al. 2017) data augmentation during the detection and tracking training phases.