SpeechAlign: Aligning Speech Generation to Human Preferences
Authors: Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through both subjective and objective evaluations, we show that Speech Align can bridge the distribution gap and facilitating continuous selfimprovement of the speech language model. Moreover, Speech Align exhibits robust generalization capabilities and works for smaller models. |
| Researcher Affiliation | Academia | Dong Zhang , Zhaowei Li , Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou , Xipeng Qiu Fudan University dongzhang22@m.fudan.edu.cn, lizhaowei126@gmail.com {zhouyaqian,xpqiu}@fudan.edu.cn |
| Pseudocode | Yes | Algorithm 1 Speech Align |
| Open Source Code | No | Answer: [No] Justification: [TODO] Guidelines: We ll open source the code once accepted. |
| Open Datasets | Yes | For the continue finetuning stage in Section 2.1, we use the Libri Speech dataset. To construct the preference codec dataset, we randomly sample 50k speech-text pairs from Libri Speech training set. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with percentages, absolute sample counts, or references to specific predefined splits for the main model training. It mentions using 'Libri Speech dataset' and 'Libri Speech training set' for data collection but not a clear split definition for validation. |
| Hardware Specification | Yes | train for 3500 steps on 8 A100 80G GPUs. |
| Software Dependencies | No | The paper mentions models like 'Whisper medium-en model', 'Speech GPT', 'Sound Storm', and 'wavlm-base-plus-sv' but does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For the continue finetuning stage in Section 2.1, the batch size is set to 256, with a learning rate of 1e-5 and train for 3500 steps on 8 A100 80G GPUs. For Co H finetuning, the batch size is set to 32, with a learning rate of 1e-5 and train for 12000 steps on 8 A100 80G GPUs. |