Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features

Authors: Xiaokun Feng, Dailing Zhang, Shiyu Hu, Xuchen Li, Meiqi Wu, Jing Zhang, Xiaotang Chen, Kaiqi Huang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks.
Researcher Affiliation Academia 1School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 2The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China 3School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 4School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China.
Pseudocode No The paper describes methods and processes in text and figures (e.g., Figure 1 for the framework), but does not contain a dedicated section or block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes The code and models will be released at: https://github.com/Xiaokun Feng/CSTrack.
Open Datasets Yes First, our training dataset includes common RGB-X datasets such as Depth Track (Yan et al., 2021b), Las He R (Li et al., 2021) and Vis Event (Wang et al., 2023). Furthermore, considering that the scale of these datasets is insufficient to support joint training, we also incorporate widely used RGB tracking datasets, namely La SOT (Fan et al., 2019), GOT-10K (Huang et al., 2019), COCO (Lin et al., 2014), Tracking Net (Muller et al., 2018), Vast Track (Peng et al., 2024), and TNL2k (Wang et al., 2021b), into our training set.
Dataset Splits Yes Depth Track (Yan et al., 2021b) is a comprehensive and long-term RGB-D tracking benchmark. It consists of 150 training sequences and 50 testing sequences, with 15 per-frame attributes. [...] Las He R (Li et al., 2021) is a large-scale, high-diversity benchmark for short-term RGB-T tracking, comprising 979 video pairs for training and 245 pairs for testing. [...] Vis Event (Wang et al., 2023) is the largest dataset for RGB-E tracking, consisting of 500 pairs of training videos and 320 pairs of testing videos.
Hardware Specification Yes Our tracker is trained on a server equipped with four A5000 GPUs and tested on an RTX-3090 GPU.
Software Dependencies No The paper mentions software components like 'HiViT-base', 'Fasti TPN pre-trained weights', and 'AdamW optimizer', but does not provide specific version numbers for these or other ancillary software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For the implementation of CSTrack, we adopt HiViT-base (Zhang et al., 2022b) as the modality-shared patch embedding and backbone. The CSTrack is initialized with Fast-i TPN (Tian et al., 2024) pre-trained weights, and the token dimension D is set to 512. which are initialized with the Fast-i TPN pre-trained parameters and the token dimension D set to 512. In the SCM, the length of modality-specific queries Nq is set to 4. In the TCM, each temporal feature is represented by 16 tokens (i.e., Nm = 16), with the temporal length L set to 4 by default. The sizes of template patches and search images are 128x128 and 256x256, respectively. Our training process consists of two stages. In the first stage, we train the model for 150 epochs without incorporating TCM. Each epoch contains 10,000 samples, and each sample consists of a single search image. In the second stage, we introduce TCM based on the model trained in the first stage, which only involves spatial compactness modeling. During this stage, the backbone of the model is frozen, and we train the remaining parameters for 50 epochs. Each epoch includes 3,000 samples, where each sample comprises six search images. We employ the AdamW optimizer to optimize the network parameters. [...] Lall = Lcls + λiou Liou + λL1L1, (15) where λiou = 2 and λL1 = 5 are the specific regularization parameters. For RGB-T benchmarks, namely Lasher and RGBT-234, we set the update threshold to 0.45; for other datasets, the threshold is set to 0.7.