Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
Authors: Ziyue Jiang, Zhe Su, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, 振辉 叶
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. |
| Researcher Affiliation | Collaboration | Ziyue Jiang Zhejiang University EMAIL Zhe Su Zhejiang University EMAIL Zhou Zhao Zhejiang University EMAIL Qian Yang Zhejiang University EMAIL Yi Ren Bytedance AI Lab EMAIL Jinglin Liu Zhejiang University EMAIL Zhenhui Ye Zhejiang University EMAIL |
| Pseudocode | No | The paper describes its methodology in natural language and with diagrams, but it does not include a dedicated section or block explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The code is available at https://github.com/Zain-Jiang/Dict-TTS. |
| Open Datasets | Yes | We evaluate Dict-TTS on three datasets of different sizes, including: 1) Biaobei [3], a Chinese speech corpus... 2) JSUT [47], a Japanese speech corpus... 3) Common Voice (HK) [1], a Cantonese speech corpus... |
| Dataset Splits | Yes | For each of the three datasets, we randomly sample 400 samples for validation and 400 samples for testing. |
| Hardware Specification | Yes | We train Dict-TTS on 1 NVIDIA 3080Ti GPU, with batch size of 40 sentences on each GPU. |
| Software Dependencies | No | The paper mentions using tools like 'pre-trained XLM-R [10] model', 'Hi Fi-GAN [29]', 'pypinyin', 'pyopenjtalk', and 'pycantonese' but does not specify their version numbers or other key software dependencies with version information required for reproducible setup in the main text. |
| Experiment Setup | Yes | We train Dict-TTS on 1 NVIDIA 3080Ti GPU, with batch size of 40 sentences on each GPU. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, ϵ = 10 9 and follow the same learning rate schedule in [53]. The softmax temperature τ is initialized and annealed using the schedule in [23]. It takes 320k steps for training until convergence. |