ShowMaker: Creating High-Fidelity 2D Human Video via Fine-Grained Diffusion Modeling
Authors: Quanwei Yang, Jiazhi Guan, Kaisiyuan Wang, Lingyun Yu, Wenqing Chu, Hang Zhou, ZhiQiang Feng, Haocheng Feng, Errui Ding, Jingdong Wang, Hongtao Xie
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive quantitative and qualitative experiments demonstrate the superior visual quality and temporal consistency of our method. |
| Researcher Affiliation | Collaboration | 1 University of Science and Technology of China 2 Tsinghua University 3 Department of Computer Vision Technology (VIS), Baidu Inc. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | We will provide the code in the camera-ready version. |
| Open Datasets | Yes | In order to verify the effectiveness of the proposed method, we select the videos of two actors, Seth and Oliver, in the talkshow [51] dataset for training and testing. In addition, to enrich the diversity of characters and hand movements, we record videos of seven people in the indoor scene. [...] We randomly divide the training set and test set according to 9:1. |
| Dataset Splits | No | The paper mentions training and testing splits (9:1) but does not explicitly state a separate validation set split. |
| Hardware Specification | Yes | All experiments were completed on 8 A800s, with a learning rate of 1e-5. |
| Software Dependencies | No | The paper mentions the use of several pre-trained models and tools (e.g., DWPose, SD, CLIP, Arc Face, Animate Diff) but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For data processing, we crop out the body region with a resolution of 512 512 and utilize the pre-trained face detection model [2] to crop and align faces following FFHQ [20]. The resolution of the face image is 256 256. The hand and face masks are determined based on the largest circumscribed rectangle of the corresponding key points. All experiments were completed on 8 A800s, with a learning rate of 1e-5. For the first training stage, the batch size B is set to 24, and sequence length F is set to 1. The training step is 100k, which takes about six days. For the second stage, B and F are set to 1 and 24, respectively with a 30k training step, taking about one day. During inference, we adopt a CFG [10] of 7.5 and perform 30 denoising steps using the DDIM sampler. |