Is Multiple Object Tracking a Matter of Specialization?
Authors: Gianluca Mancusi, Mattia Bernardi, Aniello Panariello, Angelo Porrello, Rita Cucchiara, SIMONE CALDERARA
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on MOTSynth, along with zero-shot evaluations on MOT17 and Person Path22 demonstrate that a neural tracker built from carefully selected modules surpasses its monolithic counterpart. We release models and code. |
| Researcher Affiliation | Academia | Gianluca Mancusi Mattia Bernardi Aniello Panariello Angelo Porrello Rita Cucchiara Simone Calderara AImage Lab University of Modena and Reggio Emilia name.surname@unimore.it |
| Pseudocode | No | The paper describes the methods in detail but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release models and code. |
| Open Datasets | Yes | MOTSynth [10] is a large synthetic dataset for pedestrian detection and tracking in urban scenarios, generated using a photorealistic video game. It comprises 764 full HD videos, each 1800 frames long, showcasing various attributes. |
| Dataset Splits | Yes | In our experiments, following [29], we reduced the test sequences to 600 frames each and further split the training set to extract 48 validation sequences, shortened to 150 frames, for validation during training. |
| Hardware Specification | Yes | The training is performed on a single RTX 4080 GPU with a batch size of 1 for both phases. |
| Software Dependencies | No | The paper mentions software like YOLOX, Dance Track, and ResNet but does not provide specific version numbers for these or other libraries/frameworks. |
| Experiment Setup | Yes | The learning rates are set to 5 10 5 for the transformer and 1 10 6 for the visual backbone. For the Lo RA hyperparameters, we use r = 16, a weight decay of 0.1, and a learning rate of 3 10 4. The scale & shift layers employ a learning rate of 1 10 5 and a weight decay of 1 10 4. The training is performed on a single RTX 4080 GPU with a batch size of 1 for both phases. Due to the small batch size, we accumulate gradients over four backward steps before performing an optimizer step. |