Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Metis: A Foundation Speech Generation Model with Masked Generative Pre-training
Authors: Yuancheng Wang, Jiachen Zheng, Junan Zhang, Xueyao Zhang, Huan Liao, Zhizheng Wu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. |
| Researcher Affiliation | Academia | Yuancheng Wang, Jiachen Zheng, Junan Zhang, Xueyao Zhang, Huan Liao, Zhizheng Wu The Chinese University of Hong Kong, Shenzhen EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods in prose and equations (e.g., Section 3.1 Background: Masked Generative Models) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Audio samples are are available at https://metis-demo.github.io/. We release the code and model checkpoints at https://github.com/open-mmlab/Amphion. |
| Open Datasets | Yes | Dataset We use a dataset consisting of 300K hours of speech for pre-training, including 100K hours from the Emilia [57] dataset and an additional 200K hours self-collected through the Emilia pipeline, to train our pre-trained model. ... We evaluate our zero-shot TTS models using three test sets: 1) Seed TTS test-en, a test set introduced in Seed-TTS [31] comprising 1,000 samples extracted from English public corpora, including the Common Voice dataset [63]. 2) Seed TTS test-zh, a test set introduced in Seed-TTS consisting of 2,000 samples extracted from Chinese public corpora, including the Di Di Speech dataset [64]. 3) Libri Speech test-clean [65], a widely used test set for TTS. |
| Dataset Splits | Yes | We evaluate our zero-shot TTS models using three test sets: 1) Seed TTS test-en, a test set introduced in Seed-TTS [31] comprising 1,000 samples extracted from English public corpora, including the Common Voice dataset [63]. 2) Seed TTS test-zh, a test set introduced in Seed-TTS consisting of 2,000 samples extracted from Chinese public corpora, including the Di Di Speech dataset [64]. 3) Libri Speech test-clean [65], a widely used test set for TTS. ... For target speaker extraction, we randomly sample 10K hours of data from the pre-training dataset to create the fine-tuning training set without any filtering. ... A subset of 10K hours of data is randomly sampled from the Emilia dataset as the shared training data for these tasks. |
| Hardware Specification | Yes | Training We pre-train our model on 8 GPUs for a total of 1200K steps. ... our pre-trained model converges on a single A100 GPU after just 10K steps of Lo RA fine-tuning and 5K steps of full-scale fine-tuning, using randomly sampled 0.4K hours of training data. |
| Software Dependencies | No | The paper mentions specific models used for ASR (whisper-large-v3, paraformer-zh) and specific optimizers (Adam W) but does not provide version numbers for general software dependencies such as programming languages or core deep learning frameworks (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | Training We pre-train our model on 8 GPUs for a total of 1200K steps. We use the Adam W [58] optimizer with a learning rate of 1e-4 and 32K warmup steps. We employ a dynamic batch size, where each batch contains 10K tokens (200 seconds) per GPU. During training, we randomly select a prefix of the sequence as a prompt that is not masked with a probability 𝑝= 0.8. The length of the prompt is uniformly sampled from the range [0%, 40%] of the total sequence length. |