Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VLMs can Aggregate Scattered Training Patches

Authors: Zhanhui Zhou, Lingjie Chen, Chao Yang, Chaochao Lu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each (image, ID) pair into {(patch, ID)} pairs at different granularities for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like safe or unsafe , demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks.
Researcher Affiliation	Academia	Zhanhui Zhou1 Lingjie Chen2 Chao Yang Chaochao Lu 1UC Berkeley 2University of Illinois Urbana Champaign
Pseudocode	No	The paper describes methodologies and experimental procedures in narrative text and figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code is available at https://github.com/ZHZisZZ/visual-stitching.
Open Datasets	Yes	Animal images come from Image Net [26], food images from Food101 [27], and landmark images from Pexels, a stock photography site (see Appendix A.1 for dataset details).
Dataset Splits	No	We split source datasets into patch-text sets Pf = {(patch, ID)} using split factors of f {1, 2, 4, 8}, then finetune VLMs on these sets. Empirically, to help VLMs better internalize the finetuned knowledge, we provide context by formatting the ID with the template [patch]The food/animal/landmark shown in the image is associated with ID {ID} , where [patch] is a placeholder for visual input from patchs. Unless otherwise specified, loss is computed only on the target {ID}. The paper describes how images are divided into patches for finetuning, but does not specify traditional train/validation/test splits with percentages or counts for the datasets used in experiments.
Hardware Specification	No	Table 3: Per-model configurations including Deep Speed [45] configs and GPUs. This table lists the number of GPUs used for each model but does not specify the type or model of the GPUs (e.g., NVIDIA A100).
Software Dependencies	No	We build on the TRL [44] SFTTrainer and its example VLM training script. No specific version numbers for TRL or other key software components like Python, PyTorch, or CUDA are provided.
Experiment Setup	Yes	Experiments are run with a batch size of 8 and a learning rate of 1e-5. We finetune for 15 epochs when using full images (i.e., f = 1) and 5 epochs for all other settings. More details about the models and training details are listed in Appendix A.2 and A.3. (Table 2: Hyperparameters - Batch Size 8, Learning Rate 1e-5, Mixed Precision bf16, Epoch 15 if f = 1 5 otherwise)