MemVLT: Vision-Language Tracking with Adaptive Memory-based Prompts
Authors: Xiaokun Feng, Xuchen Li, Shiyu Hu, Dailing Zhang, wu meiqi, Jing Zhang, Xiaotang Chen, Kaiqi Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct extensive experiments on mainstream VLT datasets (e.g., MGIT, TNL2K, La SOT and La SOText). Experimental results show that Mem VLT achieves new state-of-the-art performance. |
| Researcher Affiliation | Academia | 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2Institute of Automation, Chinese Academy of Sciences 3School of Computer Science and Technology, University of Chinese Academy of Sciences 4Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences 5School of Physical and Mathematical Sciences, Nanyang Technological University |
| Pseudocode | Yes | Algorithm 1 Section-Top Long-term Memory Storage Algorithm |
| Open Source Code | No | The code and models will be released at: https://github.com/Xiaokun Feng/Mem VLT. |
| Open Datasets | Yes | We use the training splits of La SOT [27], TNL2K [26], Ref COCOg [49], and OTB99-Lang [1] to train our model. |
| Dataset Splits | No | The paper mentions using 'training splits' and 'test' data, but does not explicitly provide information about validation dataset splits (e.g., percentages or counts for a validation set). |
| Hardware Specification | Yes | The model is trained on a server with four A5000 GPUs and tested on an RTX-3090 GPU. |
| Software Dependencies | No | The paper mentions specific models like Ro BERTa-Base and Hi Vi T-Base, and the Adam W optimizer, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | We use Ro BERTa-Base [44] as our text encoder and Hi Vi T-Base [42, 39, 40] as our vision encoder, with the token dimension D set to 512. The sizes of template patches and search images are 192 192 and 384 384, respectively. For the acquisition of short-term memory, both the visual and textual branches consist of three SMG layers. ... We employ the Adam W to optimize the network parameters and conduct a total of 200 training epochs. 20,000 image pairs are randomly sampled in each epoch. |