Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Expand VSR Benchmark for VLLM to Expertize in Spatial Rules
Authors: Peijin Xie, Lin Sun, Bingquan Liu, Dexin Wang, Xiangzheng Zhang, Chengjie Sun, Jiajia Zhang
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | After conducting combination experiments on scaling data and models, we obtained a VLLM VSR Expert(VSRE) that not only generalizes better to different instructions but also accurately distinguishes differences in visual positional information. VSRE achieved over a 27% increase in accuracy on the VSR test set. |
| Researcher Affiliation | Collaboration | 1Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China, 2360 Search Department, Beijing, 100020, China, 3School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, 518055, China |
| Pseudocode | No | The paper describes methodologies in paragraph text and uses figures to illustrate concepts (e.g., Figure 1 showing the overall expansion method) but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Extended version https://github.com/peijin360/vsre |
| Open Datasets | Yes | Traditional classification benchmark VSR (Liu, Emerson, and Collier 2023) has proposed a controlled probing dataset... During the image data selection from MSCOCO (Lin et al. 2014)... Existing hot benchmarks like MME (Fu et al. 2024), MMbench (Liu et al. 2023b), SEED (Li et al. 2023) test VLLMs various capabilities |
| Dataset Splits | Yes | Testing Datasets... (1)Test-G random sampled a prompt from the 50 templates pool for each triplet... (2)Test-S froze the template to the specific one ([caption], True or false.)... Excluding the test set, we collected more than 10k triplets with images from the original VSR dataset as a seed. Then we expand it several dozens of times into pre-train and IFT data as follow:... Pre-training data: Under the three settings with a ratio of 5:3:2, we repainted the original images, expanding the quantity 20 to 100 times the original amount. We label the set as pre-100k for 100k pre-training data and pre500k for 500k. IFT data: We used general 50 prompt templates (30 manual and 20 GPT4-generated) to expand the 11k triplet data nearly 50 times to 500k, then name it as turn-g 500k. |
| Hardware Specification | No | The paper mentions various models like LLa VA1.5 7B and 13B, Vicuna, LLAMA2, LLAMA3, Qwen-VL, BLIP2, and Instruct BLIP, but it does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for conducting the experiments. |
| Software Dependencies | No | The paper mentions several models and frameworks like CLIP, Sig LIP, DINOv2, SAM, SDXL, GPT-4o, and spaCy, often with citations, but it does not specify explicit version numbers for these or any other software libraries used for implementation (e.g., PyTorch 1.x, spaCy 3.x). |
| Experiment Setup | No | During the pre-training stage, we froze the LLM and only trained the adapter layers... In the fine-tuning stage, we unfroze the LLM, allowing it to participate in the training... Finally, during the inference stage, we limited the length of the model s responses to only one new word. This word was used as the answer to binary questions... Any other responses were directly judged as incorrect. The paper describes the general training process and inference constraints but lacks specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings. |