Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Authors: Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, Heng Ji

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% g Io U), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data.
Researcher Affiliation	Academia	1University of Illinois Urbana-Champaign, 2University of California Los Angeles EMAIL
Pseudocode	No	The paper describes the model architecture and training process in Section 4 and Appendix A.1, but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The code and data are publicly available at: https://github.com/Ansel Blume/partonomy
Open Datasets	Yes	We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. (...) The code and data are publicly available at: https://github.com/Ansel Blume/partonomy
Dataset Splits	Yes	We take the training splits from each, divide them into training and evaluation sets by images in an 80/20 ratio, then generate at most one question of each type for each image.
Hardware Specification	No	This work used Delta at the National Center for Supercomputing Applications through allocation #250183 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program [4], which is supported by U.S. National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Software Dependencies	Yes	We use a pre-trained LMM, LLa VA-7B, and LLa VA-llama2-13B [28] as backbones for PLUM (Sec. 4). PLUM consists of a vision-language model (initialized from LLa VA [28]) which takes image and text inputs, along with a mask decoder (initialized from SAM s decoder [17]) that generates segmentation masks. (...) Training uses Deep Speed Ze RO-2 with bf16 precision, a per-GPU batch of 6, and gradient_accumulation_steps= 10 (effective batch 10 bsz NGPU). Weights are updated by Adam W (β=(0.9, 0.95), no weight-decay) with a peak learning-rate of 3 10 4, linearly warmed up for the first 100 optimization steps and clipped to a global norm of 1.0 thereafter.
Experiment Setup	Yes	Table 6: Hyperparameters used for all experiments. We juxtapose four segmenting LMMs, including PLUM, against each other to illustrate the hyperparameter differences among the models. (...) Input resolution (px2) 10242 pixels and truncate text to 512 tokens. Training uses Deep Speed Ze RO-2 with bf16 precision, a per-GPU batch of 6, and gradient_accumulation_steps= 10 (effective batch 10 bsz NGPU). Weights are updated by Adam W (β=(0.9, 0.95), no weight-decay) with a peak learning-rate of 3 10 4, linearly warmed up for the first 100 optimization steps and clipped to a global norm of 1.0 thereafter.