Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval
Authors: Siting Li, Xiang Gao, Simon S Du
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate current retrievers on handling attribute-focused queries, we build COCO-FACET, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference. |
| Researcher Affiliation | Academia | Siting Li University of Washington EMAIL Xiang Gao IIIS, Tsinghua University EMAIL Simon Shaolei Du University of Washington EMAIL |
| Pseudocode | No | The paper describes methods like promptable image embeddings and linear approximation but does not present them in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | https://github.com/lst627/COCO-Facet |
| Open Datasets | Yes | To evaluate current retrievers on handling attribute-focused queries, we build COCO-FACET, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. ... We utilize the existing annotations provided by MSCOCO [Lin et al., 2014], Visual7W [Zhu et al., 2016], Vis Dial [Das et al., 2017], and COCO-Stuff [Caesar et al., 2018] about COCO images. ... The link for downloading our COCO-FACET benchmark is attached in the main text... |
| Dataset Splits | Yes | We use the validation set of MSCOCO 2017 for comparison after converting it to the same format ( Find me an everyday image that matches the given caption. +COCO caption as the query text, and 100 candidate images). ... All samples were sourced from the val2017 split of the COCO dataset. |
| Hardware Specification | Yes | All evaluations of CLIP-family, Magic Lens, MLLM-based universal multimodal retrievers, and variants of VLM2Vec can be done using one A6000 GPU with 48GB memory in less than 6 hours per category. ... We evaluate the actual inference cost of our pipeline on an A6000 GPU. |
| Software Dependencies | No | The paper mentions using specific models like CLIP and VLM2Vec, and libraries like FAISS, but does not provide specific version numbers for the ancillary software used in their implementation (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | We use GPT-4o [Hurst et al., 2024] to generate questions with the following template: Write a question to ask about the {Attribute Name} in a image, with possible answers such as {A}, {B}, and so on. Please answer in one sentence without mentioning any answer. ... We test the method on VLM2Vec-Phi-3.5-V with K = 100 for each category. The results are shown in Table 6... |