Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VLMs can Aggregate Scattered Training Patches
Authors: Zhanhui Zhou, Lingjie Chen, Chao Yang, Chaochao Lu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each (image, ID) pair into {(patch, ID)} pairs at different granularities for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like safe or unsafe , demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. |
| Researcher Affiliation | Academia | Zhanhui Zhou1 Lingjie Chen2 Chao Yang Chaochao Lu 1UC Berkeley 2University of Illinois Urbana Champaign |
| Pseudocode | No | The paper describes methodologies and experimental procedures in narrative text and figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ZHZisZZ/visual-stitching. |
| Open Datasets | Yes | Animal images come from Image Net [26], food images from Food101 [27], and landmark images from Pexels, a stock photography site (see Appendix A.1 for dataset details). |
| Dataset Splits | No | We split source datasets into patch-text sets Pf = {(patch, ID)} using split factors of f {1, 2, 4, 8}, then finetune VLMs on these sets. Empirically, to help VLMs better internalize the finetuned knowledge, we provide context by formatting the ID with the template [patch]The food/animal/landmark shown in the image is associated with ID {ID} , where [patch] is a placeholder for visual input from patchs. Unless otherwise specified, loss is computed only on the target {ID}. The paper describes how images are divided into patches for finetuning, but does not specify traditional train/validation/test splits with percentages or counts for the datasets used in experiments. |
| Hardware Specification | No | Table 3: Per-model configurations including Deep Speed [45] configs and GPUs. This table lists the number of GPUs used for each model but does not specify the type or model of the GPUs (e.g., NVIDIA A100). |
| Software Dependencies | No | We build on the TRL [44] SFTTrainer and its example VLM training script. No specific version numbers for TRL or other key software components like Python, PyTorch, or CUDA are provided. |
| Experiment Setup | Yes | Experiments are run with a batch size of 8 and a learning rate of 1e-5. We finetune for 15 epochs when using full images (i.e., f = 1) and 5 epochs for all other settings. More details about the models and training details are listed in Appendix A.2 and A.3. (Table 2: Hyperparameters - Batch Size 8, Learning Rate 1e-5, Mixed Precision bf16, Epoch 15 if f = 1 5 otherwise) |