SpeechAlign: Aligning Speech Generation to Human Preferences

Authors: Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through both subjective and objective evaluations, we show that Speech Align can bridge the distribution gap and facilitating continuous selfimprovement of the speech language model. Moreover, Speech Align exhibits robust generalization capabilities and works for smaller models.
Researcher Affiliation Academia Dong Zhang , Zhaowei Li , Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou , Xipeng Qiu Fudan University dongzhang22@m.fudan.edu.cn, lizhaowei126@gmail.com {zhouyaqian,xpqiu}@fudan.edu.cn
Pseudocode Yes Algorithm 1 Speech Align
Open Source Code No Answer: [No] Justification: [TODO] Guidelines: We ll open source the code once accepted.
Open Datasets Yes For the continue finetuning stage in Section 2.1, we use the Libri Speech dataset. To construct the preference codec dataset, we randomly sample 50k speech-text pairs from Libri Speech training set.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with percentages, absolute sample counts, or references to specific predefined splits for the main model training. It mentions using 'Libri Speech dataset' and 'Libri Speech training set' for data collection but not a clear split definition for validation.
Hardware Specification Yes train for 3500 steps on 8 A100 80G GPUs.
Software Dependencies No The paper mentions models like 'Whisper medium-en model', 'Speech GPT', 'Sound Storm', and 'wavlm-base-plus-sv' but does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes For the continue finetuning stage in Section 2.1, the batch size is set to 256, with a learning rate of 1e-5 and train for 3500 steps on 8 A100 80G GPUs. For Co H finetuning, the batch size is set to 32, with a learning rate of 1e-5 and train for 12000 steps on 8 A100 80G GPUs.