Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion

Authors: Zhiwei Lin, Yongtao Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on LVIS show that our method surpasses the previous open-set and open-ended methods, especially on rare objects.
Researcher Affiliation Academia Zhiwei Lin Yongtao Wang Wangxuan Institute of Computer Technology, Peking University, China EMAIL
Pseudocode No The paper describes the methodology in prose and diagrams (Figure 1, 2, 3) but does not include structured pseudocode or algorithm blocks.
Open Source Code No We do not provide new datasets and will release partial code after the paper is accepted.
Open Datasets Yes Experimental results on LVIS show that our method surpasses the previous open-set and open-ended methods, especially on rare objects. We mainly evaluate the proposed method on the LVIS dataset, which contains 1203 categories. We adopt the fixed AP [8] as the evaluation metric on frequent, common, and rare classes.
Dataset Splits Yes We mainly evaluate the proposed method on the LVIS dataset, which contains 1203 categories. We adopt the fixed AP [8] as the evaluation metric on frequent, common, and rare classes. [...] We report fixed AP [8] on LVIS val and minival [13].
Hardware Specification Yes All training can be done on 8 NVIDIA A800 GPUs within two days.
Software Dependencies No The paper mentions specific models and frameworks used (e.g., Intern VL-2.5-8B, LLMDet, MMDetection) but does not provide specific version numbers for underlying software libraries or programming languages.
Experiment Setup Yes For VL-SAM, we choose Intern VL-2.5-8B with Intern Vi T-300M [6] and Intern LM2.5-7B [5] as the vision-language model. We set the temperature to 0.8 and top-p for nucleus sampling to 0.8 for Intern VL-2.5-8B. For the open-set model, we select LLMDet [11] as the baseline model because of its SOTA performance. The number N of additional learnable queries is set to 900. For denoising points, both hyper-parameters λ1 and λ2 are set to 1. The whole model of VL-SAM-V2 is fine-tuned with Grounding Cap-1M dataset [11] following the training protocol of LLMDet [11]. During training, only the self-attention modules in general and specific query fusion and box heads are fine-tuned, while others are frozen. We fine-tune VL-SAM-V2 for 150k iterations using automatic mixed-precision with a batch size of 16.