Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Authors: Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, Zhizheng Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with 100K hours of in-the-wild speech demonstrate that Mask GCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility. Our experiments demonstrate that Mask GCT has achieved performance comparable to or superior to that of existing models in terms of speech quality, similarity, prosody, and intelligibility.
Researcher Affiliation	Collaboration	Yuancheng Wang1, Haoyue Zhan2, Liwei Liu1, Ruihong Zeng2, Haotian Guo1, Jiachen Zheng1, Qiang Zhang2, Shunsi Zhang2, Xueyao Zhang1, Zhizheng Wu1 1The Chinese University of Hong Kong, Shenzhen 2Guangzhou Quwan Network Technology EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes in detail, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code	Yes	We release our code and model checkpoints at https://github.com/open-mmlab/ Amphion/blob/main/models/tts/maskgct.
Open Datasets	Yes	We use the Emilia [48] dataset to train our models. ... We evaluate our zero-shot TTS models with three benchmarks: (1) Libri Speech [49] test-clean... (2) Seed TTS test-en... (3) Seed TTS test-zh... For accent imitation, we randomly sampled a portion of data from the L2-ARCTIC [60] accent corpus and the ESD [61] emotion corpus...
Dataset Splits	Yes	For the construction of the train and test datasets, we selected one male and one female speaker each from native English and native Mandarin backgrounds, resulting in a total of four speakers for the test dataset. The remaining 16 speakers were allocated to the training dataset. For the 350 parallel Chinese utterances, we randomly chose 22 utterances for the test set, with the remaining utterances designated for training. Similarly, for the 350 parallel English utterances, we randomly selected 21 utterances for the test set, with the rest used for training.
Hardware Specification	Yes	We train all models on 8 NVIDIA A100 80GB GPUs. ... Table 9: Real-time factor (RTF) comparison of Mask GCT and AR + Sound Storm on an A100 GPU for generating a 20-second speech.
Software Dependencies	No	The paper mentions several models, frameworks, and tools like "Hu BERT-based ASR model", "Whisper-large-v3", "Paraformer-zh", "Llama-style Transformer", "phonemize", "jieba", and "pypinyin", but it does not provide specific version numbers for these software components or libraries, which is required for reproducibility.
Experiment Setup	Yes	We optimize these models with the Adam W [57] optimizer with a learning rate of 1e-4 and 32K warmup steps, following the inverse square root learning schedule. ... For the T2S model, we use 50 steps as the default total inference steps. The classifier-free guidance scale and the classifier-free guidance rescale factor [59] are set to 2.5 and 0.75, respectively. ... For sampling, we use a top-k of 20, with the sampling temperature annealing from 1.5 to 0. ... For the S2A model, we use [40, 16, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] steps for acoustic RVQ layers by default...