Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

Authors: Canyu Zhao, Yanlong Sun, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model s prior knowledge. We compare the performance of specialized models, existing multi-task models, and our DICEPTION across various tasks. Specifically, we evaluate depth using the same protocol as Genpercept [120], normal estimation using the same method as Stable Normal [129], interactive segmentation using the same approach as SAM [90], and human keypoints using the same method as Painter [115]. We also assess instance segmentation and entity segmentation on the MS COCO dataset.
Researcher Affiliation Collaboration Canyu Zhao1 Yanlong Sun2 Mingyu Liu1 Huanyi Zheng1 Muzhi Zhu1 Zhiyue Zhao1 Hao Chen1 Tong He1,3 Chunhua Shen1,4, 1 Zhejiang University 2 Tsinghua University 3 Shanghai AI Laboratory 4 Zhejiang University of Technology
Pseudocode Yes Algorithm 1 Keypoints Post-processing Input: human pose RGB x, GT keypoints Kgt, RGB tolerance σ, distance threshold ξ Output: extracted keypoints Kpred ... Algorithm 2 Segmentation Post-processing Input: RGB segmentation mask m, RGB tolerance σ, area threshold ξ, kernel size k, connected components number threshold η, duplicate mask threshold β Output: extracted masks Mpred
Open Source Code No Justification: We intend to further refine our model before releasing it as open source.
Open Datasets Yes Data. We randomly select 500k images from the Open Images [53] dataset and use Depth Pro [7] and Stable Normal [129] to generate depth and normal annotations. For interactive segmentation, we randomly select 400k images from the SA-1B [51] dataset, as well as 200k images with fine-grained hair masks synthesized from the AM2k [58], AIM500 [59], and P3M-10k [57]. Entity segmentation data is from Entity V2 [84], while instance segmentation data comes from the COCO-Rem [97], and human pose data is sourced from COCO [64]. For few-shot fine-tuning, we select 50 samples from the Chest X-Ray dataset [114], LOL-v2 [127], and Kaggle s Brain Tumor dataset as training samples. More details can be found in Appendix A.
Dataset Splits No The paper mentions quantities of data used for training (e.g., "500k images from the Open Images [53] dataset", "400k images from the SA-1B [51] dataset") and for few-shot fine-tuning (e.g., "50 samples"), and lists several datasets used for validation. However, it does not explicitly provide how its overall combined training data is split into training, validation, and test sets with specific percentages or counts for its main multi-task model.
Hardware Specification Yes Our training lasts for 24 days using 4 NVIDIA H800 GPUs. ... Lo RA training is conducted on a single NVIDIA H100 GPU, with a constant learning rate of 2e 5 and a batch size of 8. The inference can be run on a GPU of 24GB memory with a batch size of 4.
Software Dependencies No The paper mentions using "Adam W optimizer" and "Lo RA [44]" but does not specify version numbers for any software libraries, programming languages, or frameworks (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes Our training lasts for 24 days using 4 NVIDIA H800 GPUs. We employ the Adam W optimizer with a constant learning rate of 2e 5 and a batch size of 28 per GPU. ... Specifically, in each batch, depth and normal each account for 15%, interactive segmentation, entity segmentation, and instance segmentation each account for 20%, and pose estimation each account for 20%. ... During few-shot fine-tuning, we apply a rank-128 Lo RA to all attention Q, K, and V layers in the network, which accounts for less than 1% of the total network parameters. ... Lo RA training is conducted on a single NVIDIA H100 GPU, with a constant learning rate of 2e 5 and a batch size of 8. ... We perform 28 steps of denoising during inference which follows the settings of the pre-trained model SD3 [29]. The inference can be run on a GPU of 24GB memory with a batch size of 4. The classifier-free-guidance value is by default set to 2, more analysis in Appendix B.