Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction
Authors: Wenyue Chen, Peng Li, Wangguandong Zheng, Chengfeng Zhao, Mengfei Li, Yaolong Zhu, Zhiyang Dou, Ronggang Wang, Yuan Liu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Sync Human achieves robust and photorealistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models. |
| Researcher Affiliation | Academia | Wenyue Chen1, Peng Li2 , Wangguandong Zheng3, Chengfeng Zhao2 Mengfei Li2, Yaolong Zhu1, Zhiyang Dou4, Ronggang Wang1, Yuan Liu2 1PKU, 2HKUST, 3SEU, 4MIT |
| Pseudocode | No | The paper describes algorithms and methods using mathematical equations and textual explanations, but no distinct pseudocode or algorithm blocks are present. |
| Open Source Code | No | Our evaluation uses the public datasets. We do not provide code in Supplementary Material. But they will be made publicly available once they have been fully prepared. |
| Open Datasets | Yes | Our models are trained on several widely used 3D human scanning datasets, including THuman2.1 [70], Custom Humans [12], THuman3.0 [51], and 2K2K [10]. To construct training images, we render 8 ground-truth images using orthographic cameras with evenly distributed azimuth angles and a fixed 0 elevation with a resolution of 768 768. For quantitative evaluation, we utilize 100 scans from X-Humans [47] and 150 scans from CAPE [32]. |
| Dataset Splits | Yes | For quantitative evaluation, we utilize 100 scans from X-Humans [47] and 150 scans from CAPE [32]. X-Humans contains 233 sequences of high-quality textured scans from 20 participants. We randomly selected 5 textured scans from each of the 20 participants in the X-Humans dataset, resulting in 100 test samples. Following ICON s partitioning criteria, we subdivide CAPE into "CAPE-FP" (50 samples) and "CAPE-NFP" (100 samples) to test the generalization ability in real-world examples. |
| Hardware Specification | Yes | Our 2D-3D Cross-Space generative model was trained on 8 NVIDIA H800 GPUs. For the multiview generative model branch, we adopt the architecture of PSHuman [25] but retrain it using flow matching from the open-source pre-trained text-to-image generation model, SD2.1-unclip [43]. ... Our Multiview Guided Decoder was trained on 1 NVIDIA H800 GPU. |
| Software Dependencies | Yes | Note that the multiview generative model is based on the Stable Diffusion 2.1 [44], and we retarget it to the same flow matching model as Trellis for jointly training. ... For the mesh decoder, we utilize Nvdiffrast [21] to render the extracted mesh along with its attributes... |
| Experiment Setup | Yes | A.1 Training Details: Our 2D-3D Cross-Space generative model was trained on 8 NVIDIA H800 GPUs. For the multiview generative model branch, we adopt the architecture of PSHuman [25] but retrain it using flow matching from the open-source pre-trained text-to-image generation model, SD2.1-unclip [43]. We train the multiview generation branch separately with a batch size of 32 for a total of 30,000 iterations. We adopt an adaptive learning rate schedule, initializing the learning rate at 1e-4 and decreasing it to 5e-5 after 2,000 steps. For 2D-3D Cross-Space generative model, we initialize the network weights using: the fine-tuned weights from our multiview generation branch (as described above), a pre-trained image-to-3D model (Trellis [61]). Additionally, we perform zero-initialization on the output layer of the 2D-3D synchronization attention module. We train the 2D-3D Cross-Space generative model with a batch size of 32 for a total of 50,000 iterations. We adopt an adaptive learning rate schedule, initializing the learning rate at 2.5e-5 and decreasing it to 1.25e-5 after 2,000 steps. To enable class-free guidance (CFG) [14] during inference, we randomly omit the image condition at a rate of 0.05 during training. Our Multiview Guided Decoder was trained on 1 NVIDIA H800 GPU. We train the decoder with a batch size of 4 for a total of 14,000 iterations, using a learning rate of 1e-4. |