Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

Authors: Ruiqi Wang, Hao Helen Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate by extensive experiments that RESAnything achieves superior performance among zero-shot methods on traditional RES benchmarks such as Ref COCO, Ref COCO+ [78], Ref COCOg [40, 42]. Our method also significantly outperforms existing methods on the recent reasoning segmentation dataset Reason Seg [25], as well as RES tasks in challenging scenarios involving implicit queries and complex part-level relationships such as those from ABO-Image-ARES. We evaluate RESAnything on the Reason Seg benchmark (Table 2), where our method achieves state-of-the-art performance of 74.6% g Io U and 72.5% c Io U, surpassing LISA-13B by 17% and SAM4MLLM by 16%. Notably, while LISA variants require fine-tuning on reasoning tasks and GLa MM & SAM4MLLM rely on extensive training data, RESAnything achieves this superior performance without any task-specific training, demonstrating the effectiveness of leveraging MLLMs for deep reasoning.
Researcher Affiliation Academia Ruiqi Wang Hao Zhang Simon Fraser University EMAIL
Pseudocode Yes Algorithm 1 Grouping and Selection Process
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: Code and data are not included in the submission.
Open Datasets Yes Public datasets. Following the most previous works on referring segmentation [25, 12], we evaluate the performance of RESAnything on four public benchmark datasets: Ref COCO, Ref COCO+ [78], Ref COCOg [40, 42] and Reason Seg [25]. ABO-Image-ARES benchmark. To further evaluate the capability of RESAnything in handling implicit expressions (e.g., part-level materials, features, and functionalities), we establish the ABO-Image-ARES benchmark for complex reasoning segmentation tasks. We build upon the ABO dataset, which contains product listings with rich metadata, images, and 3D models from Amazon.com.
Dataset Splits Yes Being a zero-shot method, we directly evaluate on the validation and test sets without any fine-tuning. ABO-Image-ARES benchmark. To further evaluate the capability of RESAnything in handling implicit expressions (e.g., part-level materials, features, and functionalities), we establish the ABO-Image-ARES benchmark for complex reasoning segmentation tasks. Our dataset consists of 2,989 expression-segment pairs: 1,360 with object/part semantic labels, 742 depicting logos/packaging labels, 502 referring to functions/designs, and finally, 385 covering material/style properties. The final dataset consists only of expressions that received strong majority approval (3-1 or 4-0 votes) and demonstrated clear visual grounding in the product images. This rigorous curation process yielded 2,989 referring expressions, each targeting part-level regions and describing specific materials, features, functionalities, or packaging elements.
Hardware Specification Yes Our experiments were conducted on a server with 8 NVIDIA 32GB V100 GPUs for parallel inference, but the entire inference process can run effectively on just a single NVIDIA 24GB 4090 GPU.
Software Dependencies No We use Pixtral 12B [4] as the MLLM, SAM Vi TH [24] for generating segmentation proposals, and CLIP-Vi T-B-32 for CLIP scores.
Experiment Setup Yes We configure SAM with sampling points at 0.015% of total image pixels and filter out segments smaller than 0.1% of the image area, preventing over-segmentation while maintaining meaningful region proposals. We set the threshold for final verification step to 1 for all experiments. These optimizations include: 1) utilizing the bfloat16 data format for the LLM, which is not supported on V100; 2) enabling flash attention for more efficient transformer operations; 3) implementing batch generation for LLM outputs rather than sequential processing of each reference and candidate text; and 4) employing batch computation for CLIP similarity scores.