Revisiting motion information for RGB-Event tracking with MOT philosophy

Authors: Tianlu Zhang, Kurt Debattista, Qiang Zhang, guiguang ding, Jungong Han

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed method is evaluated on multiple benchmark datasets and achieves state-of-the-art performance on all the datasets tested.
Researcher Affiliation Academia Tianlu Zhang EMIM Xidian University tlzhang96@outlook.com Kurt Debattista Warwick Manufacturing Group University of Warwick K.Debattista@warwick.ac.uk Qiang Zhang* EMIM Xidian University qzhang@@xidian.edu.cn Guiguang Ding School of Software Tsinghua University dinggg@tsinghua.edu.cn Jungong Han* Department of Automation Tsinghua University jungonghan77@gmail.com
Pseudocode No The paper describes its methodology and components using text and diagrams but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing the code for the work described, nor does it provide a direct link to a source-code repository.
Open Datasets Yes We evaluate the performance of our proposed CSAM on three large-scale RGB-E single object tracking datasets: Vis Event [29] FE108 [38] and COESOT [24].
Dataset Splits No The paper provides training and testing splits for the datasets (e.g., 'divided into 827 and 527 sequences for training and testing, respectively' for COESOT), but it does not specify explicit validation dataset splits.
Hardware Specification Yes The CSAM training is conducted on two Nvidia RTX 3090 GPUs. For inference, we test our tracker on a single Nvidia RTX 3090 GPU.
Software Dependencies Yes Our proposed CSAM is implemented in Python 3.8 using Py Torch 1.7.1.
Experiment Setup Yes The search region is 42 times the target object area and resized to a resolution of 256 256 pixels, whilst the template is 22 times the target object area and resized to 128 128 pixels. ... We initialize Vi T-Tiny using the weights from Dei T-tiny[25], and the backbone weights Vi T-B are initialized with corresponding MAE encoders[12]. ... The overall loss function can be written as: L = λ1Lfocal + λ2LL1 + λ3Lgiou , where the hyper-parameters λ1, λ2, and λ3 are set as 1, 1, and 14, respectively. ... optimized by the Adam W optimizer with a weight decay of 1 10 4 for 50 epochs. The initial learning rate for the backbone and other parameters were set to 4 10 5 and 4 10 4, respectively. ... Our candidate matching network is optimized by the Adam optimizer with a weight decay of 0.2 for 15 epochs. The initial learning rate is set to 4 10 5.