Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
Authors: Ziyue Jiang, Zhe Su, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, 振辉 叶
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. |
| Researcher Affiliation | Collaboration | Ziyue Jiang Zhejiang University ziyuejiang@zju.edu.cn Zhe Su Zhejiang University suzhesz00@gmail.com Zhou Zhao Zhejiang University zhaozhou@zju.edu.cn Qian Yang Zhejiang University qyang1021@foxmail.com Yi Ren Bytedance AI Lab ren.yi@bytedance.com Jinglin Liu Zhejiang University jinglinliu@zju.edu.cn Zhenhui Ye Zhejiang University zhenhuiye@zju.edu.cn |
| Pseudocode | No | The paper describes its methodology in natural language and with diagrams, but it does not include a dedicated section or block explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The code is available at https://github.com/Zain-Jiang/Dict-TTS. |
| Open Datasets | Yes | We evaluate Dict-TTS on three datasets of different sizes, including: 1) Biaobei [3], a Chinese speech corpus... 2) JSUT [47], a Japanese speech corpus... 3) Common Voice (HK) [1], a Cantonese speech corpus... |
| Dataset Splits | Yes | For each of the three datasets, we randomly sample 400 samples for validation and 400 samples for testing. |
| Hardware Specification | Yes | We train Dict-TTS on 1 NVIDIA 3080Ti GPU, with batch size of 40 sentences on each GPU. |
| Software Dependencies | No | The paper mentions using tools like 'pre-trained XLM-R [10] model', 'Hi Fi-GAN [29]', 'pypinyin', 'pyopenjtalk', and 'pycantonese' but does not specify their version numbers or other key software dependencies with version information required for reproducible setup in the main text. |
| Experiment Setup | Yes | We train Dict-TTS on 1 NVIDIA 3080Ti GPU, with batch size of 40 sentences on each GPU. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, ϵ = 10 9 and follow the same learning rate schedule in [53]. The softmax temperature τ is initialized and annealed using the schedule in [23]. It takes 320k steps for training until convergence. |