Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm

Authors: Jinrong Zhang, Penghui Wang, Chunxiao Liu, Wei Liu, Dian Jin, Qiong Zhang, Erli Meng, Zhengnan Hu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on public datasets, showing that MI Grounding achieves competitive performance on OSOD and OSS benchmarks compared to text prompt paradigm methods and visual prompt paradigm methods. Moreover, MI Grounding can greatly outperform existing method on our constructed specialized ADR50K dataset.
Researcher Affiliation Collaboration Jinrong Zhang2*, Penghui Wang1, Chunxiao Liu1, Wei Liu1, Dian Jin1, Qiong Zhang1 , Erli Meng1 , Zhengnan Hu1 1Xiaomi AI Lab, Beijing, China 2Dalian University of Technology, Dalian, China EMAIL, EMAIL
Pseudocode No The paper includes architectural diagrams (Figure 2, 4) and mathematical formulas but does not contain explicitly labeled pseudocode or algorithm blocks with structured steps in a code-like format.
Open Source Code No The paper does not contain an explicit statement or a direct link indicating the release of source code for the methodology described.
Open Datasets Yes We conduct extensive experiments on public datasets, showing that MI Grounding achieves competitive performance on OSOD and OSS benchmarks compared to text prompt paradigm methods and visual prompt paradigm methods. In MI Grounding-S, we use only the COCO (Lin et al. 2014) and LVIS (Gupta, Dollar, and Girshick 2019) datasets for joint training and test on the COCO, ADE20K (Zhou et al. 2017), and Seg In W (Zou et al. 2023) datasets. In MI Grounding-D, we use only the Objects365 (Shao et al. 2019) dataset for training and test on the COCO, LVIS, and ODin W (Li et al. 2022) datasets.
Dataset Splits Yes We allocated approximately 10% of the data to the test set, with the remaining data used for the training set.
Hardware Specification Yes It s important to note that we train MI Grounding-D on Objects365 for 32 A100 days and MI Grounding-S on COCO+LVIS for 16 A100 days.
Software Dependencies No The paper mentions using 'Vi T-L as the vision backbone' but does not specify any programming languages or libraries with version numbers, which are required for software dependencies.
Experiment Setup Yes In both MI Grounding-S and MI Grounding-D, we use Vi T-L as the vision backbone. We use 8 as the number of image prompts in our method, as discussed in the ablation study. During training, we randomly sample N cropped instance images as image prompts for each category, updating them every iteration. Finally, we set the model to update the image prompts once per iteration. The loss function L of MI Grounding consists of classification loss Lclass, localization losses LL1 and LGIo U, and segmentation loss Lmask. For the classification loss, we use a contrastive loss (Radford et al. 2021). For the localization loss, we apply L1 loss (Ren et al. 2015) for regressing the bounding box coordinates and GIo U loss (Rezatofighi et al. 2019) to enhance convergence stability. In the segmentation loss, Lmask is a cross-entropy loss for mask segmentation.