reproducibilityindex.ai

InstructSpeech: Following Speech Editing Instructions via Large Language Models

Authors: Rongjie Huang, Ruofan Hu, Yongqi Wang, Zehan Wang, Xize Cheng, Ziyue Jiang, Zhenhui Ye, Dongchao Yang, Luping Liu, Peng Gao, Zhou Zhao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that Instruct Speech achieves state-of-the-art results in eleven tasks, for the first time unlocking the ability to edit the acoustic and semantic attributes of speech following a user s instruction.
Researcher Affiliation	Collaboration	1Zhejiang University 2The Chinese University of Hong Kong 3Shanghai AI Lab.
Pseudocode	Yes	Algorithm 1 Multi-step reasoning for free-form editing. We use EASR, EDur, EI respectively to denote the task embedding of automatic speech recognition, frame-level duration prediction, and task categories prediction tasks.
Open Source Code	No	The paper states 'Audio samples are available at https:// Instruct Speech.github.io' but does not provide a concrete statement or link for the open-source code of their methodology.
Open Datasets	Yes	For speech processing and speech editing tasks, we use Librilight (Kahn et al., 2020), Libri Speech (Panayotov et al., 2015), Libri TTS (Zen et al., 2019), and VCTK (Veaux et al., 2017) datasets.
Dataset Splits	No	The paper mentions using datasets for training and testing but does not provide specific details on train/validation/test dataset splits, such as percentages or sample counts for each split.
Hardware Specification	Yes	During training, we train Instruct Speech for 100K steps using 8 V100 GPUs with a batch size of 6000 tokens for each GPU on the publicly-available fairseq framework (Ott et al., 2019).
Software Dependencies	No	The paper mentions using the 'fairseq framework' but does not specify its version number or other software dependencies with their respective versions.
Experiment Setup	Yes	During training, we train Instruct Speech for 100K steps using 8 V100 GPUs with a batch size of 6000 tokens for each GPU on the publicly-available fairseq framework (Ott et al., 2019). Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9. Big VGAN is optimized with a segment size of 8192 and a learning rate of 1 10 4 until 500K steps using 4 V100 GPUs. For sampling, we employ topp (Holtzman et al., 2019) sampling with p = 0.25.