Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion

Authors: Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, Shunsi Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Playmate not only outperforms existing state-of-the-art methods in terms of video quality, but also exhibits strong competitiveness in lip synchronization while offering improved flexibility in controlling emotion and head pose. (from abstract) and 4. Experiments (section title).
Researcher Affiliation	Industry	Guangzhou Quwan Network Technology. Correspondence to: Jiaran Cai <EMAIL>, Xingpei Ma <EMAIL>.
Pseudocode	No	The paper describes the methodology in detail, including equations for transformations and loss functions, but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code will be available at https://github.com/Playmate111/Playmate.
Open Datasets	Yes	Datasets. We utilize a mixture of datasets, including AVSpeech (Ephrat et al., 2018), Celeb V-Text (Yu et al., 2023), Acappella (Montesinos et al., 2021), MEAD (Wang et al., 2020), MAFW (Liu et al., 2022), and a talking video dataset collected by us to train our Playmate.
Dataset Splits	Yes	In the first stage, we selected approximately 80,000 video clips from the AVSpeech, Celeb V-Text, Acappella, and our own dataset to train the diffusion transformer. For the second phase, we selected approximately 30,000 emotionally labeled video clips from the MEAD, MAFW, and our own dataset to train the emotion control module. The duration of each training video ranges from 3 to 30 seconds. We set aside a portion of video clips from our dataset that were not involved in the training process as our out-of-distribution test set.
Hardware Specification	Yes	The first training phase utilized four NVIDIA A100 GPUs over a 3-day period, with models initialized from scratch. In the second phase, we continued training for two days with two NVIDIA A100 GPUs, while freezing the parameters of the diffusion transformer.
Software Dependencies	No	The paper mentions using the Adam optimizer and models like Arc Face and an emotion classifier but does not provide specific version numbers for any key software dependencies like programming languages or deep learning frameworks.
Experiment Setup	Yes	In our experiments, the videos are initially converted to 25 fps and subsequently cropped to a resolution of 256 256 pixels... The final output resolution is set to 512 512 pixels. During preprocessing, the audios were resampled to 16k Hz... For all experiments, we employed the Adam optimizer (Kingma, 2014). In the inference phase, multi-condition CFG is performed. The CFG scales of the audio wa and the emotion condition we are set to 1.5.