Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Expand VSR Benchmark for VLLM to Expertize in Spatial Rules

Authors: Peijin Xie, Lin Sun, Bingquan Liu, Dexin Wang, Xiangzheng Zhang, Chengjie Sun, Jiajia Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	After conducting combination experiments on scaling data and models, we obtained a VLLM VSR Expert(VSRE) that not only generalizes better to different instructions but also accurately distinguishes differences in visual positional information. VSRE achieved over a 27% increase in accuracy on the VSR test set.
Researcher Affiliation	Collaboration	1Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China, 2360 Search Department, Beijing, 100020, China, 3School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, 518055, China
Pseudocode	No	The paper describes methodologies in paragraph text and uses figures to illustrate concepts (e.g., Figure 1 showing the overall expansion method) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Extended version https://github.com/peijin360/vsre
Open Datasets	Yes	Traditional classification benchmark VSR (Liu, Emerson, and Collier 2023) has proposed a controlled probing dataset... During the image data selection from MSCOCO (Lin et al. 2014)... Existing hot benchmarks like MME (Fu et al. 2024), MMbench (Liu et al. 2023b), SEED (Li et al. 2023) test VLLMs various capabilities
Dataset Splits	Yes	Testing Datasets... (1)Test-G random sampled a prompt from the 50 templates pool for each triplet... (2)Test-S froze the template to the specific one ([caption], True or false.)... Excluding the test set, we collected more than 10k triplets with images from the original VSR dataset as a seed. Then we expand it several dozens of times into pre-train and IFT data as follow:... Pre-training data: Under the three settings with a ratio of 5:3:2, we repainted the original images, expanding the quantity 20 to 100 times the original amount. We label the set as pre-100k for 100k pre-training data and pre500k for 500k. IFT data: We used general 50 prompt templates (30 manual and 20 GPT4-generated) to expand the 11k triplet data nearly 50 times to 500k, then name it as turn-g 500k.
Hardware Specification	No	The paper mentions various models like LLa VA1.5 7B and 13B, Vicuna, LLAMA2, LLAMA3, Qwen-VL, BLIP2, and Instruct BLIP, but it does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for conducting the experiments.
Software Dependencies	No	The paper mentions several models and frameworks like CLIP, Sig LIP, DINOv2, SAM, SDXL, GPT-4o, and spaCy, often with citations, but it does not specify explicit version numbers for these or any other software libraries used for implementation (e.g., PyTorch 1.x, spaCy 3.x).
Experiment Setup	No	During the pre-training stage, we froze the LLM and only trained the adapter layers... In the fine-tuning stage, we unfroze the LLM, allowing it to participate in the training... Finally, during the inference stage, we limited the length of the model s responses to only one new word. This word was used as the answer to binary questions... Any other responses were directly judged as incorrect. The paper describes the general training process and inference constraints but lacks specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings.