Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Authors: Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, Zhizheng Wu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with 100K hours of in-the-wild speech demonstrate that Mask GCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility. Our experiments demonstrate that Mask GCT has achieved performance comparable to or superior to that of existing models in terms of speech quality, similarity, prosody, and intelligibility. |
| Researcher Affiliation | Collaboration | Yuancheng Wang1, Haoyue Zhan2, Liwei Liu1, Ruihong Zeng2, Haotian Guo1, Jiachen Zheng1, Qiang Zhang2, Shunsi Zhang2, Xueyao Zhang1, Zhizheng Wu1 1The Chinese University of Hong Kong, Shenzhen 2Guangzhou Quwan Network Technology EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and processes in detail, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code. |
| Open Source Code | Yes | We release our code and model checkpoints at https://github.com/open-mmlab/ Amphion/blob/main/models/tts/maskgct. |
| Open Datasets | Yes | We use the Emilia [48] dataset to train our models. ... We evaluate our zero-shot TTS models with three benchmarks: (1) Libri Speech [49] test-clean... (2) Seed TTS test-en... (3) Seed TTS test-zh... For accent imitation, we randomly sampled a portion of data from the L2-ARCTIC [60] accent corpus and the ESD [61] emotion corpus... |
| Dataset Splits | Yes | For the construction of the train and test datasets, we selected one male and one female speaker each from native English and native Mandarin backgrounds, resulting in a total of four speakers for the test dataset. The remaining 16 speakers were allocated to the training dataset. For the 350 parallel Chinese utterances, we randomly chose 22 utterances for the test set, with the remaining utterances designated for training. Similarly, for the 350 parallel English utterances, we randomly selected 21 utterances for the test set, with the rest used for training. |
| Hardware Specification | Yes | We train all models on 8 NVIDIA A100 80GB GPUs. ... Table 9: Real-time factor (RTF) comparison of Mask GCT and AR + Sound Storm on an A100 GPU for generating a 20-second speech. |
| Software Dependencies | No | The paper mentions several models, frameworks, and tools like "Hu BERT-based ASR model", "Whisper-large-v3", "Paraformer-zh", "Llama-style Transformer", "phonemize", "jieba", and "pypinyin", but it does not provide specific version numbers for these software components or libraries, which is required for reproducibility. |
| Experiment Setup | Yes | We optimize these models with the Adam W [57] optimizer with a learning rate of 1e-4 and 32K warmup steps, following the inverse square root learning schedule. ... For the T2S model, we use 50 steps as the default total inference steps. The classifier-free guidance scale and the classifier-free guidance rescale factor [59] are set to 2.5 and 0.75, respectively. ... For sampling, we use a top-k of 20, with the sampling temperature annealing from 1.5 to 0. ... For the S2A model, we use [40, 16, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] steps for acoustic RVQ layers by default... |