Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards General Continuous Memory for Vision-Language Models

Authors: Wenyi WU, Zixuan Song, Kun Zhou, Yifei Shao, Zhiting Hu, Biwei Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across eight multimodal reasoning benchmarks demonstrate the effectiveness of our approach.
Researcher Affiliation Academia Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao, Zhiting Hu, Biwei Huang University of California, San Diego. EMAIL
Pseudocode No The paper describes methods using textual descriptions and mathematical formulas (e.g., Equation 1) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and data is publicly released here https://github.com/Wenyi WU0111/Co MEM.
Open Datasets Yes We use WIT [40] (Wikipedia-based Image Text Dataset) as our retrieval knowledge base. Building upon this, we conduct experiments across eight multimodal and multilingual reasoning benchmarks, including six multimodal reasoning benchmarks: Info Seek [35], OVEN [42], MRAG-Bench [43], OK-VQA [36], A-OKVQA [37], and Vi Qu AE [44], and two multilingual benchmarks: CVQA [45] and multilingual Info Seek. In Appendix D.1 Evaluation on Image Captioning, it states: "on a caption generation task using the COCO 2014 dataset [63]."
Dataset Splits Yes Specifically, we begin by selecting questions from the training sets of Info Seek [35], Encyclopedic-VQA (EVQA) [41], and OK-VQA [36] to ensure coverage of diverse multimodal reasoning tasks. For Info Seek, the ground truth answers for test sets are not publicly available, so we follow prior work [38, 26, 27] and report results on the validation sets. These sets include questions not seen during training and those associated with unseen entities. Overall, our final fine-tuning corpus for continuous memory includes 15.6K curated samples.
Hardware Specification Yes Our entire training process can be completed on a single NVIDIA H100 GPU in 20 hours. All experiments were conducted on a single NVIDIA H100 and the results are summarized in Table 10.
Software Dependencies No The paper does not explicitly state specific version numbers for software dependencies such as Python, PyTorch, or other libraries. It mentions using CLIP-based retriever [39] which implies a software component, but without a version number.
Experiment Setup Yes Concretely, we only need to fine-tune the low-rank adaptation matrices (Lo RA) [24] in the VLM-based memory encoder, and a lightweight Q-Former [25] for further compressing the VLM representations into only eight embeddings, 1.2% parameters in total. For efficiency, we apply Lo RA with a rank of 16 and share parameters across all layers of the Q-Former. The whole process is formulated as: H(0) = q, H(ℓ) = Transformer Layer(ℓ) H(ℓ 1), Et , Vt = H(L) (1). We also empirically find the training converges fast, and a single epoch is sufficient to achieve strong performance.