Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Seg4Diff: Unveiling Open-Vocabulary Semantic Segmentation in Text-to-Image Diffusion Transformers
Authors: Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, Seungryong Kim
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-Di T block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. |
| Researcher Affiliation | Academia | Chaehyun Kim1 Heeseong Shin1 Eunbeen Hong1 Heeji Yoon1 Anurag Arnab Paul Hongsuck Seo2 Sunghwan Hong3, Seungryong Kim1, 1KAIST AI 2Korea University 3ETH Zรผrich AI Center, CVG, PRS |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes methodologies using mathematical formulations and descriptive text, but no structured algorithm steps are presented. |
| Open Source Code | Yes | https://cvlab-kaist.github.io/Seg4Diff |
| Open Datasets | Yes | Training uses 10k images from either SA-1B [30] or COCO [34], with captions generated by Cog VLM [61] following SD3 s procedure [15]. We evaluate our method on two tasks: open-vocabulary semantic segmentation and unsupervised segmentation. For open-vocabulary semantic segmentation, we report m Io U on the validation sets of Pascal VOC [16], COCO-Object [34], Pascal Context-59 [16], and ADE20K [69], excluding the background class. |
| Dataset Splits | Yes | Training uses 10k images from either SA-1B [30] or COCO [34], with captions generated by Cog VLM [61] following SD3 s procedure [15]. For open-vocabulary semantic segmentation, we report m Io U on the validation sets of Pascal VOC [16], COCO-Object [34], Pascal Context-59 [16], and ADE20K [69], excluding the background class. |
| Hardware Specification | Yes | Training runs on two NVIDIA A6000 GPUs with per-device batch size 4 and gradient accumulation for an effective batch size of 16. |
| Software Dependencies | No | The paper mentions software like Adam W and LoRA modules but does not provide specific version numbers for any libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | For zero-shot inference, the diffusion process is fixed at timestep t = 8 of 28 using the flow-matching Euler discrete scheduler. ... Images are processed at 1024 1024 resolution, and each transformer layer is equipped with a Lo RA module of rank r = 16, trained using Adam W with lr = 1 10 5, default ฮฒ parameters, and weight decay. Training runs on two NVIDIA A6000 GPUs with per-device batch size 4 and gradient accumulation for an effective batch size of 16. We used classifier-free guidance [23] with scale 7.5 for generation if not specified. |