Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding
Authors: Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that CTXvec2wav outperforms Hifi GAN and Audio LM in terms of speech resynthesis from semantic tokens. Moreover, we show that Uni CATS achieves state-of-the-art performance in both speech continuation and editing. |
| Researcher Affiliation | Collaboration | 1Mo E Key Lab of Artificial Intelligence, AI Institute X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 2 Shenzhen Research Institute of Big Data, Shenzhen, China 3 AISpeech Ltd, Beijing, China |
| Pseudocode | Yes | Algorithm 1: Inference of CTX-txt2vec for speech editing. |
| Open Source Code | No | The paper mentions 'Audio samples are available at https://cpdu.github.io/unicats' which links to a demo page, not the source code for the methodology. Footnotes refer to third-party tools/models but no clear statement or link to the authors' own implementation code is provided. |
| Open Datasets | Yes | Libri TTS is a multi-speaker transcribed English speech dataset. Its training set consists of approximately 580 hours of speech data from 2,306 speakers. |
| Dataset Splits | No | The paper describes how test sets A, B, and C are derived from the Libri TTS dataset, but it does not specify explicit training/validation splits (e.g., percentages or sample counts) for reproducibility of the data partitioning. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions optimizers (Adam W, Adam) and refers to pretrained models (vq-wav2vec), but it does not provide specific version numbers for any software dependencies, libraries, or frameworks used. |
| Experiment Setup | Yes | In CTX-txt2vec, the text encoder consists of 6 layers of Transformer blocks. The VQ-diffusion decoder employs N = 12 Transformer-based blocks with attention layers comprising 8 heads and a dimension of 512. In Equation 7, the value of γ is set to 1. [...] CTX-txt2vec is trained for 50 epochs using an Adam W (Loshchilov and Hutter 2017) optimizer with a weight decay of 4.5 10 2. The number of diffusion steps is set to T = 100. In CTX-vec2wav, both semantic encoders consist of M = 2 Conformer-based blocks. The attention layers within these blocks have 2 heads and a dimension of 184. The mel encoder employs a 1D convolution with a kernel size of 5 and an output channel of 184. CTX-vec2wav is trained using an Adam (Kingma and Ba 2014) optimizer for 800k steps. The initial learning rate is set to 2 10 4 and is halved every 200k steps. |