Revisiting motion information for RGB-Event tracking with MOT philosophy
Authors: Tianlu Zhang, Kurt Debattista, Qiang Zhang, guiguang ding, Jungong Han
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed method is evaluated on multiple benchmark datasets and achieves state-of-the-art performance on all the datasets tested. |
| Researcher Affiliation | Academia | Tianlu Zhang EMIM Xidian University tlzhang96@outlook.com Kurt Debattista Warwick Manufacturing Group University of Warwick K.Debattista@warwick.ac.uk Qiang Zhang* EMIM Xidian University qzhang@@xidian.edu.cn Guiguang Ding School of Software Tsinghua University dinggg@tsinghua.edu.cn Jungong Han* Department of Automation Tsinghua University jungonghan77@gmail.com |
| Pseudocode | No | The paper describes its methodology and components using text and diagrams but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | No | The paper does not include an unambiguous statement that the authors are releasing the code for the work described, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | We evaluate the performance of our proposed CSAM on three large-scale RGB-E single object tracking datasets: Vis Event [29] FE108 [38] and COESOT [24]. |
| Dataset Splits | No | The paper provides training and testing splits for the datasets (e.g., 'divided into 827 and 527 sequences for training and testing, respectively' for COESOT), but it does not specify explicit validation dataset splits. |
| Hardware Specification | Yes | The CSAM training is conducted on two Nvidia RTX 3090 GPUs. For inference, we test our tracker on a single Nvidia RTX 3090 GPU. |
| Software Dependencies | Yes | Our proposed CSAM is implemented in Python 3.8 using Py Torch 1.7.1. |
| Experiment Setup | Yes | The search region is 42 times the target object area and resized to a resolution of 256 256 pixels, whilst the template is 22 times the target object area and resized to 128 128 pixels. ... We initialize Vi T-Tiny using the weights from Dei T-tiny[25], and the backbone weights Vi T-B are initialized with corresponding MAE encoders[12]. ... The overall loss function can be written as: L = λ1Lfocal + λ2LL1 + λ3Lgiou , where the hyper-parameters λ1, λ2, and λ3 are set as 1, 1, and 14, respectively. ... optimized by the Adam W optimizer with a weight decay of 1 10 4 for 50 epochs. The initial learning rate for the backbone and other parameters were set to 4 10 5 and 4 10 4, respectively. ... Our candidate matching network is optimized by the Adam optimizer with a weight decay of 0.2 for 15 epochs. The initial learning rate is set to 4 10 5. |