Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards General Continuous Memory for Vision-Language Models

Authors: Wenyi WU, Zixuan Song, Kun Zhou, Yifei Shao, Zhiting Hu, Biwei Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across eight multimodal reasoning benchmarks demonstrate the effectiveness of our approach.
Researcher Affiliation	Academia	Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao, Zhiting Hu, Biwei Huang University of California, San Diego. EMAIL
Pseudocode	No	The paper describes methods using textual descriptions and mathematical formulas (e.g., Equation 1) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data is publicly released here https://github.com/Wenyi WU0111/Co MEM.
Open Datasets	Yes	We use WIT [40] (Wikipedia-based Image Text Dataset) as our retrieval knowledge base. Building upon this, we conduct experiments across eight multimodal and multilingual reasoning benchmarks, including six multimodal reasoning benchmarks: Info Seek [35], OVEN [42], MRAG-Bench [43], OK-VQA [36], A-OKVQA [37], and Vi Qu AE [44], and two multilingual benchmarks: CVQA [45] and multilingual Info Seek. In Appendix D.1 Evaluation on Image Captioning, it states: "on a caption generation task using the COCO 2014 dataset [63]."
Dataset Splits	Yes	Specifically, we begin by selecting questions from the training sets of Info Seek [35], Encyclopedic-VQA (EVQA) [41], and OK-VQA [36] to ensure coverage of diverse multimodal reasoning tasks. For Info Seek, the ground truth answers for test sets are not publicly available, so we follow prior work [38, 26, 27] and report results on the validation sets. These sets include questions not seen during training and those associated with unseen entities. Overall, our final fine-tuning corpus for continuous memory includes 15.6K curated samples.
Hardware Specification	Yes	Our entire training process can be completed on a single NVIDIA H100 GPU in 20 hours. All experiments were conducted on a single NVIDIA H100 and the results are summarized in Table 10.
Software Dependencies	No	The paper does not explicitly state specific version numbers for software dependencies such as Python, PyTorch, or other libraries. It mentions using CLIP-based retriever [39] which implies a software component, but without a version number.
Experiment Setup	Yes	Concretely, we only need to fine-tune the low-rank adaptation matrices (Lo RA) [24] in the VLM-based memory encoder, and a lightweight Q-Former [25] for further compressing the VLM representations into only eight embeddings, 1.2% parameters in total. For efficiency, we apply Lo RA with a rank of 16 and share parameters across all layers of the Q-Former. The whole process is formulated as: H(0) = q, H(ℓ) = Transformer Layer(ℓ) H(ℓ 1), Et , Vt = H(L) (1). We also empirically find the training converges fast, and a single epoch is sufficient to achieve strong performance.