Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning

Authors: Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that, enhanced by INST-IT, our models not only achieve outstanding performance on INST-IT Bench and other instance understanding benchmarks, but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our method not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.
Researcher Affiliation Collaboration 1Institute of Trustworthy Embodied AI, Fudan University 2Shanghai Innovation Institute 3Huawei Noah s Ark Lab
Pseudocode No The paper describes the methods in narrative text and figures (e.g., Fig. 1, Fig. 2) and provides task prompt templates (Fig. 5, 6, 7), but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The codes, models, dataset, and benchmark will be fully open-sourced. We will release all the codes, data, and models once the blind review period is finished.
Open Datasets Yes We utilize five video instance segmentation datasets (BRUST [3, 18], UVO [83], OVIS [65], LVVIS [79] and Youtube VIS-2021 [89]) and two object tracking datasets (Ben SMOT [38], Vid OR [75]) as our video sources, as they provide annotations of instance locations, which is useful in So M visual prompting [88]. For the image source, we select the SA-1B [29] dataset due to its diversity and abundance of instance objects.
Dataset Splits Yes To prevent data leakage, we use videos from the test split, ensuring no overlap with INST-IT Dataset. We apply the pipeline in Sec. 2.1 to generate 20 open-ended QA pairs for each image and video. [...] INST-IT Bench comprises 1,000 QA pairs for 338 images and 1,000 QA pairs for 206 videos.
Hardware Specification Yes We use 8 H100 for all experiments. The image-video joint training stage takes approximately 20 hours when using Vicuna-7B as the language model and 24 hours using Qwen2-7B with Sig LIP-SO400M-384.
Software Dependencies No We use LLa VA-Ne XT [44] as our baseline due to its widespread adoption. In the default configuration, Vicuna-1.5-7B [16] serves as the language model with CLIP-Vi T-336 [67] as the vision encoder. We utilize the Adam W [49] with a cosine learning rate schedule for optimization. [...] Furthermore, we use Qwen2-7B [87] with Sig LIP-SO400M-384 [97] for improved performance in our main experiment, and Qwen2-1.5B with CLIP-Vi T-336 for efficiency in our ablation study.
Experiment Setup Yes We utilize the Adam W [49] with a cosine learning rate schedule for optimization. [...] We limit the maximum number of frames to 32 and the context length of LLMs to 6K due to GPU memory constraints. [...] For single images, we split the original image into up to 4 sub-images based on its aspect ratio following the Any Res [44] approach, and then concatenate the global image with these sub-images. For multiple images and video inputs, we skip the Any Res procedure and encode every single image. Additionally, we apply 2 2 spatial pooling to reduce the number of visual tokens for video inputs. [...] In this stage, we freeze the first 12 layers of the vision encoder to mitigate potential distribution shifts caused by visually prompted images.