Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

Authors: Ruiqi Wang, Hao Helen Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate by extensive experiments that RESAnything achieves superior performance among zero-shot methods on traditional RES benchmarks such as Ref COCO, Ref COCO+ [78], Ref COCOg [40, 42]. Our method also signiﬁcantly outperforms existing methods on the recent reasoning segmentation dataset Reason Seg [25], as well as RES tasks in challenging scenarios involving implicit queries and complex part-level relationships such as those from ABO-Image-ARES. We evaluate RESAnything on the Reason Seg benchmark (Table 2), where our method achieves state-of-the-art performance of 74.6% g Io U and 72.5% c Io U, surpassing LISA-13B by 17% and SAM4MLLM by 16%. Notably, while LISA variants require ﬁne-tuning on reasoning tasks and GLa MM & SAM4MLLM rely on extensive training data, RESAnything achieves this superior performance without any task-speciﬁc training, demonstrating the effectiveness of leveraging MLLMs for deep reasoning.
Researcher Affiliation	Academia	Ruiqi Wang Hao Zhang Simon Fraser University EMAIL
Pseudocode	Yes	Algorithm 1 Grouping and Selection Process
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justiﬁcation: Code and data are not included in the submission.
Open Datasets	Yes	Public datasets. Following the most previous works on referring segmentation [25, 12], we evaluate the performance of RESAnything on four public benchmark datasets: Ref COCO, Ref COCO+ [78], Ref COCOg [40, 42] and Reason Seg [25]. ABO-Image-ARES benchmark. To further evaluate the capability of RESAnything in handling implicit expressions (e.g., part-level materials, features, and functionalities), we establish the ABO-Image-ARES benchmark for complex reasoning segmentation tasks. We build upon the ABO dataset, which contains product listings with rich metadata, images, and 3D models from Amazon.com.
Dataset Splits	Yes	Being a zero-shot method, we directly evaluate on the validation and test sets without any ﬁne-tuning. ABO-Image-ARES benchmark. To further evaluate the capability of RESAnything in handling implicit expressions (e.g., part-level materials, features, and functionalities), we establish the ABO-Image-ARES benchmark for complex reasoning segmentation tasks. Our dataset consists of 2,989 expression-segment pairs: 1,360 with object/part semantic labels, 742 depicting logos/packaging labels, 502 referring to functions/designs, and ﬁnally, 385 covering material/style properties. The final dataset consists only of expressions that received strong majority approval (3-1 or 4-0 votes) and demonstrated clear visual grounding in the product images. This rigorous curation process yielded 2,989 referring expressions, each targeting part-level regions and describing speciﬁc materials, features, functionalities, or packaging elements.
Hardware Specification	Yes	Our experiments were conducted on a server with 8 NVIDIA 32GB V100 GPUs for parallel inference, but the entire inference process can run effectively on just a single NVIDIA 24GB 4090 GPU.
Software Dependencies	No	We use Pixtral 12B [4] as the MLLM, SAM Vi TH [24] for generating segmentation proposals, and CLIP-Vi T-B-32 for CLIP scores.
Experiment Setup	Yes	We conﬁgure SAM with sampling points at 0.015% of total image pixels and ﬁlter out segments smaller than 0.1% of the image area, preventing over-segmentation while maintaining meaningful region proposals. We set the threshold for final verification step to 1 for all experiments. These optimizations include: 1) utilizing the bﬂoat16 data format for the LLM, which is not supported on V100; 2) enabling ﬂash attention for more efﬁcient transformer operations; 3) implementing batch generation for LLM outputs rather than sequential processing of each reference and candidate text; and 4) employing batch computation for CLIP similarity scores.