Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MoCha: Towards Movie-Grade Talking Character Generation

Authors: Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei-Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, Animesh Sinha, Peter Vajda, Wenhu Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive qualitative and quantitative evaluations, including human evaluation studies and benchmark comparisons, demonstrate that Mo Cha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, controllability and generalization.
Researcher Affiliation Collaboration Cong Wei1,2 , Bo Sun2 , Haoyu Ma2, Ji Hou2, Felix Juefei-Xu2, Zecheng He2, Xiaoliang Dai2, Luxin Zhang2, Kunpeng Li2, Tingbo Hou2, Animesh Sinha2, Peter Vajda2, Wenhu Chen1 1University of Waterloo 2Gen AI, Meta
Pseudocode No The paper describes the model architecture and training strategy in detail across sections 3 and 4, but it does not include a distinct section or figure explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code No We publicly release the evaluation benchmark, but the code is not yet available.
Open Datasets Yes We introduce Mo Cha-bench, a benchmark tailored for the Talking Character generation task. It contains 200 diverse samples, each comprising a text prompt and corresponding audio clip. The dataset spans various camera shot angles and camera movement for example, closeup shots emphasize facial expressions and lip-sync, while medium shots highlight hand gestures and body movement. Scenes cover a wide range of human activities and object interactions (e.g., woman holding a coffee cup, professor talking to student), with characters speaking with various emotions and facing directions. All prompts were manually curated and further enriched using the publicly released LLa MA-3 [31] model to enhance expressiveness and diversity.
Dataset Splits No The paper mentions that 'Mo Cha-bench' contains 200 diverse samples, but it does not specify any training, validation, or test splits for this benchmark. For the training data, it discusses 'Mixed-Modal Sampling' (80% Multimodal, 20% Unimodal) and a 'Shot-Type-Based Curriculum' for training, but these refer to sampling strategies during training rather than explicit train/test/validation splits for evaluation or reproducibility purposes.
Hardware Specification Yes Training is distributed across 64 compute nodes.
Software Dependencies No Our audio pipeline is powered by Wav2Vec2, but instead of relying solely on its final output, we extract and stitch together the embeddings from all 12 internal layers. This approach gives us a deeper, layered view of the audio content, with each layer contributing a 768-dimensional slice to the overall representation. After running the audio through Wav2Ve2 s tokenizer, we stretch or compress the resulting sequence using linear interpolation before Before the audio hits Wav2Vec2. So that we end up with the same number of audio features as there are video frames effectively assigning a unique audio token to each frame. To provide each frame s audio token with extra context, we expand its feature vector by gluing on the tokens from the five frames before and after it. So for any given frame f, the final embedding is built as A(f) = [A(f 5), . . . , A(f), . . . , A(f + 5)]. This chunky, context-aware audio feature then passes through a straightforward two-layer neural net (an MLP with a hidden size of 512) which reshapes it into the 6144-dimensional token α(f) needed by our model s backbone.
Experiment Setup Yes We used a constant learning rate scheduler with 2000 warm-up steps. ... Throughout training, all input examples are resized to a resolution of approximately 720 px, preserving their original aspect ratios. ... We build Mo Cha based on Movie Gen Backbone. The Stage 0 training is included in the Movie Gen backbone pertaining. Then we add the speech cross attention and speech projector to the Movie Gen backbone to build Mo Cha. During Stages 1-N training, We full-finetuning the entire 30B Mo Cha model while freezing the text encoder and speech encoder and text projector.