MemVLT: Vision-Language Tracking with Adaptive Memory-based Prompts

Authors: Xiaokun Feng, Xuchen Li, Shiyu Hu, Dailing Zhang, wu meiqi, Jing Zhang, Xiaotang Chen, Kaiqi Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we conduct extensive experiments on mainstream VLT datasets (e.g., MGIT, TNL2K, La SOT and La SOText). Experimental results show that Mem VLT achieves new state-of-the-art performance.
Researcher Affiliation Academia 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2Institute of Automation, Chinese Academy of Sciences 3School of Computer Science and Technology, University of Chinese Academy of Sciences 4Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences 5School of Physical and Mathematical Sciences, Nanyang Technological University
Pseudocode Yes Algorithm 1 Section-Top Long-term Memory Storage Algorithm
Open Source Code No The code and models will be released at: https://github.com/Xiaokun Feng/Mem VLT.
Open Datasets Yes We use the training splits of La SOT [27], TNL2K [26], Ref COCOg [49], and OTB99-Lang [1] to train our model.
Dataset Splits No The paper mentions using 'training splits' and 'test' data, but does not explicitly provide information about validation dataset splits (e.g., percentages or counts for a validation set).
Hardware Specification Yes The model is trained on a server with four A5000 GPUs and tested on an RTX-3090 GPU.
Software Dependencies No The paper mentions specific models like Ro BERTa-Base and Hi Vi T-Base, and the Adam W optimizer, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries.
Experiment Setup Yes We use Ro BERTa-Base [44] as our text encoder and Hi Vi T-Base [42, 39, 40] as our vision encoder, with the token dimension D set to 512. The sizes of template patches and search images are 192 192 and 384 384, respectively. For the acquisition of short-term memory, both the visual and textual branches consist of three SMG layers. ... We employ the Adam W to optimize the network parameters and conduct a total of 200 training epochs. 20,000 image pairs are randomly sampled in each epoch.