Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Authors: Ziyue Jiang, Zhe Su, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, 振辉 叶

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems.
Researcher Affiliation Collaboration Ziyue Jiang Zhejiang University ziyuejiang@zju.edu.cn Zhe Su Zhejiang University suzhesz00@gmail.com Zhou Zhao Zhejiang University zhaozhou@zju.edu.cn Qian Yang Zhejiang University qyang1021@foxmail.com Yi Ren Bytedance AI Lab ren.yi@bytedance.com Jinglin Liu Zhejiang University jinglinliu@zju.edu.cn Zhenhui Ye Zhejiang University zhenhuiye@zju.edu.cn
Pseudocode No The paper describes its methodology in natural language and with diagrams, but it does not include a dedicated section or block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes The code is available at https://github.com/Zain-Jiang/Dict-TTS.
Open Datasets Yes We evaluate Dict-TTS on three datasets of different sizes, including: 1) Biaobei [3], a Chinese speech corpus... 2) JSUT [47], a Japanese speech corpus... 3) Common Voice (HK) [1], a Cantonese speech corpus...
Dataset Splits Yes For each of the three datasets, we randomly sample 400 samples for validation and 400 samples for testing.
Hardware Specification Yes We train Dict-TTS on 1 NVIDIA 3080Ti GPU, with batch size of 40 sentences on each GPU.
Software Dependencies No The paper mentions using tools like 'pre-trained XLM-R [10] model', 'Hi Fi-GAN [29]', 'pypinyin', 'pyopenjtalk', and 'pycantonese' but does not specify their version numbers or other key software dependencies with version information required for reproducible setup in the main text.
Experiment Setup Yes We train Dict-TTS on 1 NVIDIA 3080Ti GPU, with batch size of 40 sentences on each GPU. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, ϵ = 10 9 and follow the same learning rate schedule in [53]. The softmax temperature τ is initialized and annealed using the schedule in [23]. It takes 320k steps for training until convergence.