Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion
Authors: Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, Shunsi Zhang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Playmate not only outperforms existing state-of-the-art methods in terms of video quality, but also exhibits strong competitiveness in lip synchronization while offering improved flexibility in controlling emotion and head pose. (from abstract) and 4. Experiments (section title). |
| Researcher Affiliation | Industry | Guangzhou Quwan Network Technology. Correspondence to: Jiaran Cai <EMAIL>, Xingpei Ma <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in detail, including equations for transformations and loss functions, but does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code will be available at https://github.com/Playmate111/Playmate. |
| Open Datasets | Yes | Datasets. We utilize a mixture of datasets, including AVSpeech (Ephrat et al., 2018), Celeb V-Text (Yu et al., 2023), Acappella (Montesinos et al., 2021), MEAD (Wang et al., 2020), MAFW (Liu et al., 2022), and a talking video dataset collected by us to train our Playmate. |
| Dataset Splits | Yes | In the first stage, we selected approximately 80,000 video clips from the AVSpeech, Celeb V-Text, Acappella, and our own dataset to train the diffusion transformer. For the second phase, we selected approximately 30,000 emotionally labeled video clips from the MEAD, MAFW, and our own dataset to train the emotion control module. The duration of each training video ranges from 3 to 30 seconds. We set aside a portion of video clips from our dataset that were not involved in the training process as our out-of-distribution test set. |
| Hardware Specification | Yes | The first training phase utilized four NVIDIA A100 GPUs over a 3-day period, with models initialized from scratch. In the second phase, we continued training for two days with two NVIDIA A100 GPUs, while freezing the parameters of the diffusion transformer. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and models like Arc Face and an emotion classifier but does not provide specific version numbers for any key software dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | Yes | In our experiments, the videos are initially converted to 25 fps and subsequently cropped to a resolution of 256 256 pixels... The final output resolution is set to 512 512 pixels. During preprocessing, the audios were resampled to 16k Hz... For all experiments, we employed the Adam optimizer (Kingma, 2014). In the inference phase, multi-condition CFG is performed. The CFG scales of the audio wa and the emotion condition we are set to 1.5. |