Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Stitch and Tell: A Structured Data Augmentation Method for Spatial Understanding

Authors: Yin Hang, Xiaomin He, Peiwen Yuan, Yiwei Li, Jiayi Shi, Wenxiao Fan, Shaoxiong Feng, Prof. Kan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Si Te across three architectures including LLa VA-v1.5-7B, LLa VA-Qwen2-1.5B and HALVA-7B, two training datasets, and thirteen benchmarks. Experiments show that Si Te improves spatial understanding tasks such as MMEPosition (+5.50%) and Spatial-MM (+4.19%), while maintaining or improving performance on general vision-language benchmarks.
Researcher Affiliation Collaboration Hang Yin1, Xiaomin He2, Pei Wen Yuan1, Yiwei Li1, Jiayi Shi1, Wenxiao Fan1, Shaoxiong Feng3, Kan Li1 1 School of Computer Science, Beijing Institute of Technology 2 School of Software and Microelectronics, Peking University 3 Xiaohongshu Inc
Pseudocode Yes The pseudocode of image stitching process is illustrated in Algorithm 1.
Open Source Code No We will release the data and code soon.
Open Datasets Yes We investigated existing large-scale multi-modal datasets, such as Conceptual Captions [28], COCO [18], VQA [1], and SBU Captions [24], and observed only a small fraction of samples contain explicit spatial information (see Table 1). As shown in Figure 1, the spatially-aware data means the samples whose captions include clear spatial information (e.g., to the left of", in front of ). During training, the model aligns visual content with limited linguistic descriptions, which may constrain its ability to capture latent spatial structures from the image alone.
Dataset Splits Yes By default, the stitched data make up one-third of the total set. To avoid duplicate supervision, image caption pairs used for stitching are removed from the raw set, ensuring each image appears only once. As a result, the final number of training samples is slightly smaller than 558K. We design 35 templates for horizontal stitching and 29 templates for veritcal stitching. For each sample, a template is randomly selected from the corresponding set based on the stitching mode. This diversity in spatial phrasing encourages the model to learn spatial relations in a more flexible and robust manner, rather than relying on fixed linguistic patterns. To further compare Si Te-augmented data with existing spatial supervision data, we construct a pretraining variant by substituting part of the original image caption data with an equal-sized spatially-focused samples from the Visual Spatial Reasoning (VSR) dataset [19]. In default setting, the number of VSR data is 5K. Additionally, we compare Si Te with two standard augmentation baselines: Rotate and Crop.
Hardware Specification Yes For the pretraining stage, we train LLa VA-v1.5-7B and LLa VA-Qwen2-1.5B using a batch size of 16 and 64 per GPU, respectively, on 8 L20Z GPUs.
Software Dependencies No Noun phrases are extracted from original sentences using Qwen2.5-72B-Instruct with the prompt: Extract the concrete, visible physical objects or entities described in this sentence. Return a comma-separated list. Ignore abstract terms like type , color , time , or actions. Based on the extracted entities, we construct spatial question answer pairs to form a new instruction-tuning dataset. These templates are automatically produced using a language model and then curated to ensure naturalness and correctness. Templates were automatically generated with instruction-tuned models and then filtered to preserve clarity, directional accuracy, and grammatical diversity.
Experiment Setup Yes For the pretraining stage, we train LLa VA-v1.5-7B and LLa VA-Qwen2-1.5B using a batch size of 16 and 64 per GPU, respectively, on 8 L20Z GPUs. All pretraining experiments are conducted for 1 epoch following the original LLa VA setup. For the supervised fine-tuning stage, we use the same hardware setup and follow the original LLa VA configuration, training both models for 1 epoch. The batch sizes are set to 4, 16 and 128 per GPU for HALVA, LLa VA-v1.5-7B and LLa VA-Qwen2-1.5B, respectively. We adopt the same learning rate and weight decay as in the original model settings. In all ablation experiments, the batch size and training schedule also remain consistent, and only the ratio of stitched data is varied.