Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

Authors: Danfeng Li, Hui Zhang, Sheng Wang, Jiacheng Li, Zuxuan Wu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities. Through comprehensive evaluations on both open-set (SACap-Eval) and closed-set (COCO-Stuff, ADE20K) benchmarks, Seg2Any consistently outperforms prior SOTA models, particularly in fine-grained spatial and attribute control of entities. 4 Experiments 4.1 Experiment Setup 4.2 Quantitative Results 4.3 Ablation Study
Researcher Affiliation Collaboration Danfeng Li1,2 Hui Zhang1,2 Sheng Wang3 Jiacheng Li3 Zuxuan Wu1,2 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing 3Hi Think Research
Pseudocode No The paper describes methods such as 'Semantic Alignment Attention Mask' and 'Sparse Shape Feature Adaptation' using textual descriptions and diagrams (e.g., Figure 3), but does not present any formal pseudocode blocks or algorithms.
Open Source Code No Our data and code are not open-source yet. However,we have provided sufficient details of our method and all the required hyperparameters in the main paper and supplementary material to enable the reproduction. We plan to open-source the data and code in the future.
Open Datasets Yes To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities. To address the lack of large-scale and fine-grained datasets for S2I generation, we construct SACap-1M, containing 1 million image-text pairs and 5.9 million segmented entities with detailed descriptions. Training datasets and Evaluation Benchmarks. Experiments are conducted on both open-set and closed-set segmentation datasets. For the open-set scenario, we utilize SACap-1M, which consists of 1 million images accompanied by 5.9 million regional captions. Evaluation for this setting is performed on SACap-Eval, a benchmark curated from a subset of SACap-1M... For closed-set scenario, we select two widely used datasets: ADE20K [59] and COCO-Stuff [3]. Following prior works [47, 24], for ADE20K and COCO-Stuff, regional captions are assigned as the semantic class names of each segment, and no global caption is provided.
Dataset Splits No For the open-set scenario, we utilize SACap-1M, which consists of 1 million images accompanied by 5.9 million regional captions. Evaluation for this setting is performed on SACap-Eval, a benchmark curated from a subset of SACap-1M, comprising 4,000 prompts with detailed entity descriptions and corresponding segmentation masks, with an average of 5.7 entities per image. For closed-set scenario, we select two widely used datasets: ADE20K [59] and COCO-Stuff [3]. Figure 13: The distribution of the number of segmentation masks per image across the training and test sets.
Hardware Specification Yes All experiments are conducted on 4 NVIDIA H100 GPUs. All experiments are conducted on a single NVIDIA H100 GPU with 32 sampling steps.
Software Dependencies No The paper mentions several models and frameworks like FLUX.1-dev, Qwen2-VL-72B [43], Mask2Former [6], Deep Lab V3 [5], SAM2 [32], and Depth Anything V2 [49]. However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes The Lo RA modules are applied to all linear layers of each block in Di T, with the Lo RA rank set to 64, resulting in approximately 594M additional parameters. Across all datasets, our model is trained for 20,000 steps with a batch size of 16, using the Adam W optimizer and a fixed learning rate of 0.0001. The training resolution is set to 1024 1024 for the SACap-1M dataset and 512 512 for ADE20K and COCO-Stuff.