InstructSpeech: Following Speech Editing Instructions via Large Language Models
Authors: Rongjie Huang, Ruofan Hu, Yongqi Wang, Zehan Wang, Xize Cheng, Ziyue Jiang, Zhenhui Ye, Dongchao Yang, Luping Liu, Peng Gao, Zhou Zhao
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Instruct Speech achieves state-of-the-art results in eleven tasks, for the first time unlocking the ability to edit the acoustic and semantic attributes of speech following a user s instruction. |
| Researcher Affiliation | Collaboration | 1Zhejiang University 2The Chinese University of Hong Kong 3Shanghai AI Lab. |
| Pseudocode | Yes | Algorithm 1 Multi-step reasoning for free-form editing. We use EASR, EDur, EI respectively to denote the task embedding of automatic speech recognition, frame-level duration prediction, and task categories prediction tasks. |
| Open Source Code | No | The paper states 'Audio samples are available at https:// Instruct Speech.github.io' but does not provide a concrete statement or link for the open-source code of their methodology. |
| Open Datasets | Yes | For speech processing and speech editing tasks, we use Librilight (Kahn et al., 2020), Libri Speech (Panayotov et al., 2015), Libri TTS (Zen et al., 2019), and VCTK (Veaux et al., 2017) datasets. |
| Dataset Splits | No | The paper mentions using datasets for training and testing but does not provide specific details on train/validation/test dataset splits, such as percentages or sample counts for each split. |
| Hardware Specification | Yes | During training, we train Instruct Speech for 100K steps using 8 V100 GPUs with a batch size of 6000 tokens for each GPU on the publicly-available fairseq framework (Ott et al., 2019). |
| Software Dependencies | No | The paper mentions using the 'fairseq framework' but does not specify its version number or other software dependencies with their respective versions. |
| Experiment Setup | Yes | During training, we train Instruct Speech for 100K steps using 8 V100 GPUs with a batch size of 6000 tokens for each GPU on the publicly-available fairseq framework (Ott et al., 2019). Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9. Big VGAN is optimized with a segment size of 8192 and a learning rate of 1 10 4 until 500K steps using 4 V100 GPUs. For sampling, we employ topp (Holtzman et al., 2019) sampling with p = 0.25. |