reproducibilityindex.ai

SpeechAlign: Aligning Speech Generation to Human Preferences

Authors: Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through both subjective and objective evaluations, we show that Speech Align can bridge the distribution gap and facilitating continuous selfimprovement of the speech language model. Moreover, Speech Align exhibits robust generalization capabilities and works for smaller models.
Researcher Affiliation	Academia	Dong Zhang , Zhaowei Li , Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou , Xipeng Qiu Fudan University dongzhang22@m.fudan.edu.cn, lizhaowei126@gmail.com {zhouyaqian,xpqiu}@fudan.edu.cn
Pseudocode	Yes	Algorithm 1 Speech Align
Open Source Code	No	Answer: [No] Justification: [TODO] Guidelines: We ll open source the code once accepted.
Open Datasets	Yes	For the continue finetuning stage in Section 2.1, we use the Libri Speech dataset. To construct the preference codec dataset, we randomly sample 50k speech-text pairs from Libri Speech training set.
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits with percentages, absolute sample counts, or references to specific predefined splits for the main model training. It mentions using 'Libri Speech dataset' and 'Libri Speech training set' for data collection but not a clear split definition for validation.
Hardware Specification	Yes	train for 3500 steps on 8 A100 80G GPUs.
Software Dependencies	No	The paper mentions models like 'Whisper medium-en model', 'Speech GPT', 'Sound Storm', and 'wavlm-base-plus-sv' but does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	For the continue finetuning stage in Section 2.1, the batch size is set to 256, with a learning rate of 1e-5 and train for 3500 steps on 8 A100 80G GPUs. For Co H finetuning, the batch size is set to 32, with a learning rate of 1e-5 and train for 12000 steps on 8 A100 80G GPUs.