Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Emotional Face-to-Speech

Authors: Jiaxin Ye, Boyuan Cao, Hongming Shan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate that DEmo Face generates more natural and consistent speech compared to baselines, even surpassing speech-driven methods. ... Extensive experimental results demonstrate that DEmo Face can generate more consistent, natural speech with enhanced emotions compared to previous methods. ... Section 5. Experimental Results, including Quantitative Evaluation, Subjective evaluation, Qualitative Results, and Ablation Studies.
Researcher Affiliation Academia 1Institute of Science and Technology for Brain-Inspired Intelligence, MOE Frontiers Center for Brain Science, Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and State Key Laboratory of Brain Function and Disorders, Fudan University, Shanghai, China. Correspondence to: Hongming Shan <EMAIL>.
Pseudocode No The paper describes the methodology in text and with figures, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Demos of DEmo Face are shown at our project https://demoface.github.io. This is a project demonstration page, not an explicit statement of code release or a direct link to a code repository.
Open Datasets Yes All our models are pre-trained on three datasets with pairs of face video and speech: RAVDESS (Livingstone & Russo, 2018), MEAD (Wang et al., 2020; Gan et al., 2023), and MELD-FAIR (Carneiro et al., 2023). ... We incorporate a 10-hour subset from LRS3 (Afouras et al., 2018) for pre-training... multiple large-scale TTS datasets (such as LRS3 (Afouras et al., 2018), Vox Celeb2 (Chung et al., 2018), and LJSpeech (Ito & Johnson, 2017), etc.).
Dataset Splits Yes The RAVDESS and MEAD of the combined one are randomly segmented into training, validation, and test sets without any speaker overlap. For the MELD-FAIR, we follow the original splits.
Hardware Specification Yes We train the model using the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 1e-4, batch size 32, and a 24GB NVIDIA RTX 4090 GPU. ... We train our identity encoder achiving face-speech alignment on a 24GB NVIDIA 4090 GPU...
Software Dependencies No The paper mentions various models (Whisper, Sep Former, HiFi-GAN via a footnote link) and optimizers (Adam W, Adam), but it does not specify concrete version numbers for any software libraries or frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes We train the model using the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9, β2 = 0.999, a learning rate of 1e-4, batch size 32... The total number of iterations is 300k. During inference, we use the Euler sampler with 96 steps... We set the joint guidance scale w0 = 1.9, and compositional scales w1 = w2 = 1.0, w3 = 1.6.