Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Authors: Shiting (Ginny) Xiao, Rishabh Kabra, Yuhang Li, Donghyun Lee, Joao Carreira, Priyadarshini Panda

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Open World SAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks. Code is available at https://github.com/Ginny Xiao/Open World SAM.
Researcher Affiliation Collaboration Shiting Xiao Yale University EMAIL Rishabh Kabra Google Deep Mind EMAIL Yuhang Li Yale University EMAIL Donghyun Lee Yale University EMAIL João Carreira Google Deep Mind EMAIL Priyadarshini Panda Yale University EMAIL
Pseudocode No The paper describes the methodology in prose and with diagrams (e.g., Figure 4) but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/Ginny Xiao/Open World SAM.
Open Datasets Yes We train Open World SAM on the COCO2017-Stuff [46] dataset with panoptic annotations following X-Decoder [15]. The training set contains 104k images. We evaluate the model in a zero-shot setting on eight segmentation tasks across five diverse datasets: ADE20K150/857 [22], PASCAL VOC-20 [50], PASCAL Context-59/459 [51], Scan Net-20/40 [52], and SUN-RGBD-37 [53]. Evaluation metrics include panoptic quality (PQ), mean average precision (m AP), and mean intersection-over-union (m Io U), corresponding to panoptic, instance, and semantic segmentation tasks, respectively. For referring segmentation, we finetune on Ref COCOg UMD training split [23].
Dataset Splits Yes We train Open World SAM on the COCO2017-Stuff [46] dataset with panoptic annotations following X-Decoder [15]. The training set contains 104k images. For referring-expression segmentation we fine-tune the pre-trained checkpoint on Ref COCOg UMD training split for 10 epochs. We evaluate the model in a zero-shot setting on eight segmentation tasks across five diverse datasets
Hardware Specification Yes It is trained for 25 epochs on COCO-Stuff using the Adam W optimizer with a learning rate of 1e-4, batch size 8, on a single NVIDIA A100 GPU. Training is conducted on a single NVIDIA A100 (80 GB) GPU with a batch size of 8. In Table 13 and 14, we present inference timing breakdowns for processing a single 1024 1024 image on an NVIDIA A5000 GPU, averaged over five independent runs.
Software Dependencies No We implement our model in PyTorch. We initialize the visual model with the weights of SAM2-Hiera-Large [18] and the VLM encoder with the weights of EVF-SAM BEIT-3Large [43]. It is trained for 25 epochs on COCO-Stuff using the Adam W optimizer with a learning rate of 1e-4, batch size 8, on a single NVIDIA A100 GPU. Image resolution is set to 1024 for SAM2 and 224 for BEi T-3. Number of postional tie-breaks is set to 20 for COCO dataset. Our implementation details can be found in Appendix A.
Experiment Setup Yes It is trained for 25 epochs on COCO-Stuff using the Adam W optimizer with a learning rate of 1e-4, batch size 8, on a single NVIDIA A100 GPU. Image resolution is set to 1024 for SAM2 and 224 for BEi T-3. Number of postional tie-breaks is set to 20 for COCO dataset. We use the panoptic annotations, which provide pixel-accurate masks and category labels for all 132 thing and stuff classes. Training is conducted on a single NVIDIA A100 (80 GB) GPU with a batch size of 8. Optimization employs Adam W (learning rate 1e-4). A step decay scheduler drops the learning rate by a factor of 0.1 at 89% and 96% of the total iterations. For referring-expression segmentation we fine-tune the pre-trained checkpoint on Ref COCOg UMD training split for 10 epochs. Because images from Ref COCOg were seen during pre-training (with category labels substituted for referring expressions ground truth), we adopt a conservative learning rate of 1e-5. We use a batch size of 8 during training.