UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding
Authors: Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that CTXvec2wav outperforms Hifi GAN and Audio LM in terms of speech resynthesis from semantic tokens. Moreover, we show that Uni CATS achieves state-of-the-art performance in both speech continuation and editing. |
| Researcher Affiliation | Collaboration | 1Mo E Key Lab of Artificial Intelligence, AI Institute X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 2 Shenzhen Research Institute of Big Data, Shenzhen, China 3 AISpeech Ltd, Beijing, China |
| Pseudocode | Yes | Algorithm 1: Inference of CTX-txt2vec for speech editing. |
| Open Source Code | No | The paper mentions 'Audio samples are available at https://cpdu.github.io/unicats' which links to a demo page, not the source code for the methodology. Footnotes refer to third-party tools/models but no clear statement or link to the authors' own implementation code is provided. |
| Open Datasets | Yes | Libri TTS is a multi-speaker transcribed English speech dataset. Its training set consists of approximately 580 hours of speech data from 2,306 speakers. |
| Dataset Splits | No | The paper describes how test sets A, B, and C are derived from the Libri TTS dataset, but it does not specify explicit training/validation splits (e.g., percentages or sample counts) for reproducibility of the data partitioning. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions optimizers (Adam W, Adam) and refers to pretrained models (vq-wav2vec), but it does not provide specific version numbers for any software dependencies, libraries, or frameworks used. |
| Experiment Setup | Yes | In CTX-txt2vec, the text encoder consists of 6 layers of Transformer blocks. The VQ-diffusion decoder employs N = 12 Transformer-based blocks with attention layers comprising 8 heads and a dimension of 512. In Equation 7, the value of γ is set to 1. [...] CTX-txt2vec is trained for 50 epochs using an Adam W (Loshchilov and Hutter 2017) optimizer with a weight decay of 4.5 10 2. The number of diffusion steps is set to T = 100. In CTX-vec2wav, both semantic encoders consist of M = 2 Conformer-based blocks. The attention layers within these blocks have 2 heads and a dimension of 184. The mel encoder employs a 1D convolution with a kernel size of 5 and an output channel of 184. CTX-vec2wav is trained using an Adam (Kingma and Ba 2014) optimizer for 800k steps. The initial learning rate is set to 2 10 4 and is halved every 200k steps. |