reproducibilityindex.ai

ShowMaker: Creating High-Fidelity 2D Human Video via Fine-Grained Diffusion Modeling

Authors: Quanwei Yang, Jiazhi Guan, Kaisiyuan Wang, Lingyun Yu, Wenqing Chu, Hang Zhou, ZhiQiang Feng, Haocheng Feng, Errui Ding, Jingdong Wang, Hongtao Xie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive quantitative and qualitative experiments demonstrate the superior visual quality and temporal consistency of our method.
Researcher Affiliation	Collaboration	1 University of Science and Technology of China 2 Tsinghua University 3 Department of Computer Vision Technology (VIS), Baidu Inc.
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	We will provide the code in the camera-ready version.
Open Datasets	Yes	In order to verify the effectiveness of the proposed method, we select the videos of two actors, Seth and Oliver, in the talkshow [51] dataset for training and testing. In addition, to enrich the diversity of characters and hand movements, we record videos of seven people in the indoor scene. [...] We randomly divide the training set and test set according to 9:1.
Dataset Splits	No	The paper mentions training and testing splits (9:1) but does not explicitly state a separate validation set split.
Hardware Specification	Yes	All experiments were completed on 8 A800s, with a learning rate of 1e-5.
Software Dependencies	No	The paper mentions the use of several pre-trained models and tools (e.g., DWPose, SD, CLIP, Arc Face, Animate Diff) but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For data processing, we crop out the body region with a resolution of 512 512 and utilize the pre-trained face detection model [2] to crop and align faces following FFHQ [20]. The resolution of the face image is 256 256. The hand and face masks are determined based on the largest circumscribed rectangle of the corresponding key points. All experiments were completed on 8 A800s, with a learning rate of 1e-5. For the first training stage, the batch size B is set to 24, and sequence length F is set to 1. The training step is 100k, which takes about six days. For the second stage, B and F are set to 1 and 24, respectively with a 30k training step, taking about one day. During inference, we adopt a CFG [10] of 7.5 and perform 30 denoising steps using the DDIM sampler.