reproducibilityindex.ai

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Authors: Ziyue Jiang, Zhe Su, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, 振辉叶

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems.
Researcher Affiliation	Collaboration	Ziyue Jiang Zhejiang University ziyuejiang@zju.edu.cn Zhe Su Zhejiang University suzhesz00@gmail.com Zhou Zhao Zhejiang University zhaozhou@zju.edu.cn Qian Yang Zhejiang University qyang1021@foxmail.com Yi Ren Bytedance AI Lab ren.yi@bytedance.com Jinglin Liu Zhejiang University jinglinliu@zju.edu.cn Zhenhui Ye Zhejiang University zhenhuiye@zju.edu.cn
Pseudocode	No	The paper describes its methodology in natural language and with diagrams, but it does not include a dedicated section or block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	The code is available at https://github.com/Zain-Jiang/Dict-TTS.
Open Datasets	Yes	We evaluate Dict-TTS on three datasets of different sizes, including: 1) Biaobei [3], a Chinese speech corpus... 2) JSUT [47], a Japanese speech corpus... 3) Common Voice (HK) [1], a Cantonese speech corpus...
Dataset Splits	Yes	For each of the three datasets, we randomly sample 400 samples for validation and 400 samples for testing.
Hardware Specification	Yes	We train Dict-TTS on 1 NVIDIA 3080Ti GPU, with batch size of 40 sentences on each GPU.
Software Dependencies	No	The paper mentions using tools like 'pre-trained XLM-R [10] model', 'Hi Fi-GAN [29]', 'pypinyin', 'pyopenjtalk', and 'pycantonese' but does not specify their version numbers or other key software dependencies with version information required for reproducible setup in the main text.
Experiment Setup	Yes	We train Dict-TTS on 1 NVIDIA 3080Ti GPU, with batch size of 40 sentences on each GPU. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, ϵ = 10 9 and follow the same learning rate schedule in [53]. The softmax temperature τ is initialized and annealed using the schedule in [23]. It takes 320k steps for training until convergence.