Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance
Authors: JUE GONG, Tingyu Yang, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, Xiaokang Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our HAODiff surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets, including our introduced MPII-Test. |
| Researcher Affiliation | Collaboration | Jue Gong1 , Tingyu Yang1 , Jingkai Wang1, Zheng Chen1, Xing Liu2, Hong Gu2, Yulun Zhang1 , Xiaokang Yang1 1Shanghai Jiao Tong University 2vivo Mobile Communication Co., Ltd |
| Pseudocode | No | The paper provides detailed descriptions of the model structure and training objectives, including mathematical formulations and diagrams like Figure 3 showing the model structure, but it does not present a clearly labeled pseudocode or algorithm block with structured steps. |
| Open Source Code | Yes | Code is available at: https://github.com/gobunu/HAODiff. |
| Open Datasets | Yes | Our model is trained on the PERSONA [11], with 20k images sampled from both LSDIR [27] and FFHQ [18]. We randomly crop LSDIR to 512x512, and pre-downsample FFHQ to the same size. We generate synthetic HQ-LQ image pairs using our degradation pipeline. For testing, we use PERSONA-Val and PERSONA-Test from OSDHuman [11]. Additionally, we select images from the MPII Human Pose dataset [1] using the data selection pipeline from OSDHuman, excluding the quality filtering stage. |
| Dataset Splits | Yes | Our model is trained on the PERSONA [11], with 20k images sampled from both LSDIR [27] and FFHQ [18]. We randomly crop LSDIR to 512x512, and pre-downsample FFHQ to the same size. We generate synthetic HQ-LQ image pairs using our degradation pipeline. For testing, we use PERSONA-Val and PERSONA-Test from OSDHuman [11]. Additionally, we select images from the MPII Human Pose dataset [1] using the data selection pipeline from OSDHuman, excluding the quality filtering stage. To accommodate the bounding boxes used in this process, we extend the outermost annotated key points outward by a certain margin to form enclosing rectangles, which are used as bounding boxes. This process yields MPII-Test, which consists of 5,427 real-world images with diverse human motion blur (HMB). |
| Hardware Specification | Yes | The training is conducted for 20k iterations on 4 NVIDIA RTX A6000 GPUs. The training is conducted for 120k iterations on 2 NVIDIA RTX A6000 GPUs. Inference is performed on images of 512 x 512 resolution on NVIDIA RTX A6000. |
| Software Dependencies | Yes | The base model is SD2.1-base [43] and Lo RA [16] is used to train the UNet with a Lo RA rank 16. |
| Experiment Setup | Yes | In stage 1, the DPG training process balances the magnitude of the L1 loss and Dice loss by setting the α in Eq. (4) to 2 x 10-2. We use the Adam optimizer [20] with learning rate 2 x 10-3 and batch size 16. The STLs in HE have 6 heads, while those in HRi use 3 heads. For HBR segmentation, the third branch outputs a single channel and utilizes the sigmoid activation function, while the other two output three channels without the activation function. The training is conducted for 20k iterations on 4 NVIDIA RTX A6000 GPUs. In stage 2, we set β in Eq. (10) to 1 x 10-2 and use the pretrained SDXL [38] UNet as the discriminator following D3SR [26]. The λcfg in Eq. (8) is set to 3.5. The AdamW optimizer [34] used in stage 2 has learning rate 1 x 10-5 and batch size 2. The base model is SD2.1-base [43] and Lo RA [16] is used to train the UNet with a Lo RA rank 16. The training is conducted for 120k iterations on 2 NVIDIA RTX A6000 GPUs. |