Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Seg4Diff: Unveiling Open-Vocabulary Semantic Segmentation in Text-to-Image Diffusion Transformers

Authors: Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, Seungryong Kim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-Di T block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity.
Researcher Affiliation	Academia	Chaehyun Kim1 Heeseong Shin1 Eunbeen Hong1 Heeji Yoon1 Anurag Arnab Paul Hongsuck Seo2 Sunghwan Hong3, Seungryong Kim1, 1KAIST AI 2Korea University 3ETH Zürich AI Center, CVG, PRS
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes methodologies using mathematical formulations and descriptive text, but no structured algorithm steps are presented.
Open Source Code	Yes	https://cvlab-kaist.github.io/Seg4Diff
Open Datasets	Yes	Training uses 10k images from either SA-1B [30] or COCO [34], with captions generated by Cog VLM [61] following SD3 s procedure [15]. We evaluate our method on two tasks: open-vocabulary semantic segmentation and unsupervised segmentation. For open-vocabulary semantic segmentation, we report m Io U on the validation sets of Pascal VOC [16], COCO-Object [34], Pascal Context-59 [16], and ADE20K [69], excluding the background class.
Dataset Splits	Yes	Training uses 10k images from either SA-1B [30] or COCO [34], with captions generated by Cog VLM [61] following SD3 s procedure [15]. For open-vocabulary semantic segmentation, we report m Io U on the validation sets of Pascal VOC [16], COCO-Object [34], Pascal Context-59 [16], and ADE20K [69], excluding the background class.
Hardware Specification	Yes	Training runs on two NVIDIA A6000 GPUs with per-device batch size 4 and gradient accumulation for an effective batch size of 16.
Software Dependencies	No	The paper mentions software like Adam W and LoRA modules but does not provide specific version numbers for any libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup	Yes	For zero-shot inference, the diffusion process is fixed at timestep t = 8 of 28 using the flow-matching Euler discrete scheduler. ... Images are processed at 1024 1024 resolution, and each transformer layer is equipped with a Lo RA module of rank r = 16, trained using Adam W with lr = 1 10 5, default β parameters, and weight decay. Training runs on two NVIDIA A6000 GPUs with per-device batch size 4 and gradient accumulation for an effective batch size of 16. We used classifier-free guidance [23] with scale 7.5 for generation if not specified.