Beyond Accuracy: Tracking more like Human via Visual Search

Authors: Dailing Zhang, Shiyu Hu, Xiaokun Feng, Xuchen Li, wu meiqi, Jing Zhang, Kaiqi Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments demonstrate that the proposed CPDTrack not only achieves state-of-the-art (SOTA) performance in this challenge but also narrows the behavioral differences with humans.
Researcher Affiliation Academia 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2Institute of Automation, Chinese Academy of Sciences 3School of Computer Science and Technology, University of Chinese Academy of Sciences 4Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences 5School of Physical and Mathematical Sciences, Nanyang Technological University
Pseudocode No The paper describes the proposed method but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Code and models are available at https://github.com/Zhang Dailing8/CPDTrack.
Open Datasets Yes Our training data includes the training splits of Video Cube[], La SOT[9], GOT-10k[19], and Tracking Net[21].
Dataset Splits No The paper mentions using training splits of various datasets (Video Cube, La SOT, GOT-10k, Tracking Net) but does not explicitly describe how these are further split into training, validation, and test sets for their own experiments, or refer to predefined validation splits with specific details like percentages or counts for hyperparameter tuning.
Hardware Specification Yes The model is trained on a server with four A5000 GPUs and is tested on an A5000 GPU.
Software Dependencies No The paper mentions the AdamW optimizer, ViT-B encoder, and MAE pre-trained parameters, but does not specify specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We train the model with Adam W[66] optimizer and set the learning rate of the encoder to 1e 5, the decoder and remaining modules to 1e 4, and the weight decay to 1e 4. The model is trained for a total of 300 epochs with 60k image pairs per epoch. The learning rate decreases by a factor of 10 after 240 epochs.