reproducibilityindex.ai

MemVLT: Vision-Language Tracking with Adaptive Memory-based Prompts

Authors: Xiaokun Feng, Xuchen Li, Shiyu Hu, Dailing Zhang, wu meiqi, Jing Zhang, Xiaotang Chen, Kaiqi Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we conduct extensive experiments on mainstream VLT datasets (e.g., MGIT, TNL2K, La SOT and La SOText). Experimental results show that Mem VLT achieves new state-of-the-art performance.
Researcher Affiliation	Academia	1School of Artificial Intelligence, University of Chinese Academy of Sciences 2Institute of Automation, Chinese Academy of Sciences 3School of Computer Science and Technology, University of Chinese Academy of Sciences 4Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences 5School of Physical and Mathematical Sciences, Nanyang Technological University
Pseudocode	Yes	Algorithm 1 Section-Top Long-term Memory Storage Algorithm
Open Source Code	No	The code and models will be released at: https://github.com/Xiaokun Feng/Mem VLT.
Open Datasets	Yes	We use the training splits of La SOT [27], TNL2K [26], Ref COCOg [49], and OTB99-Lang [1] to train our model.
Dataset Splits	No	The paper mentions using 'training splits' and 'test' data, but does not explicitly provide information about validation dataset splits (e.g., percentages or counts for a validation set).
Hardware Specification	Yes	The model is trained on a server with four A5000 GPUs and tested on an RTX-3090 GPU.
Software Dependencies	No	The paper mentions specific models like Ro BERTa-Base and Hi Vi T-Base, and the Adam W optimizer, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries.
Experiment Setup	Yes	We use Ro BERTa-Base [44] as our text encoder and Hi Vi T-Base [42, 39, 40] as our vision encoder, with the token dimension D set to 512. The sizes of template patches and search images are 192 192 and 384 384, respectively. For the acquisition of short-term memory, both the visual and textual branches consist of three SMG layers. ... We employ the Adam W to optimize the network parameters and conduct a total of 200 training epochs. 20,000 image pairs are randomly sampled in each epoch.