Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Quantifying Cross-Modality Memorization in Vision-Language Models

Authors: Yuxin Wen, Yangsibo Huang, Tom Goldstein, Ravi Kumar, Badih Ghazi, Chiyuan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities.
Researcher Affiliation Collaboration Yuxin Wen1 , Yangsibo Huang2, Tom Goldstein1, Ravi Kumar2, Badih Ghazi2, Chiyuan Zhang2 1University of Maryland, College Park 2Google
Pseudocode No No pseudocode or algorithm blocks were found in the paper. The paper describes methodologies in paragraph form and uses diagrams to illustrate concepts.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Instead of providing a copy of the data, we provide full details to reproduce the experiments, including instructions on how to generate synthetic datasets used in the experiments.
Open Datasets Yes we introduce incorporate images and captions from the COCO dataset [Lin et al., 2014].
Dataset Splits Yes The resulting synthetic persona dataset consists of a collection of 100 unique personas. Each persona is characterized by the following elements: A set of 100 image variants for training and 1 distinct image for testing. A set of 100 textual description variants for training and 1 distinct textual description for testing.
Hardware Specification Yes All training is performed on a single Nvidia A100-80G GPU.
Software Dependencies No The paper mentions fine-tuning 'Gemma-3-4b' and using 'LoRA' and 'AdamW', which are models and algorithms, respectively. However, it does not provide specific version numbers for software dependencies like programming languages, libraries (e.g., PyTorch, TensorFlow), or other frameworks used in the implementation.
Experiment Setup Yes During fine-tuning, we utilize Lo RA [Hu et al., 2022] with a rank of r = 32, a scaling factor of α = 32, and a dropout probability of 0.05. We use Adam W [Loshchilov and Hutter, 2017] with a learning rate of 2 10 4 and a batch size of 16.