BATON: Aligning Text-to-Audio Model Using Human Preference Feedback

Authors: Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Qinmei Xu, Zunnan Xu, Jingquan Liu, Jiasheng Lu, Xiu Li

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiment results demonstrate that our BATON can significantly improve the generation quality of the original text-to-audio models, concerning audio integrity, temporal relationship, and alignment with human preference.
Researcher Affiliation Collaboration Huan Liao1 , Haonan Han1 , Kai Yang1 , Tianjiao Du1 , Rui Yang1 , Qinmei Xu1 Zunnan Xu1 , Jingquan Liu1 , Jiasheng Lu2 and Xiu Li1 1Tsinghua University 2Huawei Technologies Co., Ltd.
Pseudocode No The paper describes the framework and methods but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No Project page is available at https://baton2024.github.io. (Checked the link, it states 'Code will be released soon', indicating it's not yet available).
Open Datasets Yes We specifically select audio event categories that ranked among the top 200 in occurrence within Audio Caps [Kim et al., 2019]
Dataset Splits No The paper describes the size of various datasets used (Ddata, Dhuman, Dpretrain) and how Dhuman was created and annotated, and later defines a specific test set. However, it does not provide explicit training, validation, and test splits (e.g., percentages or counts) for the main model's fine-tuning process or the reward model's training, beyond stating overall dataset sizes and how some subsets were used.
Hardware Specification No The paper states, 'utilized a total of 48GB*2 GPU memory,' but it does not specify the exact GPU model, CPU, or other detailed hardware specifications.
Software Dependencies No The paper mentions optimizers like Adam and AdamW with citations, but does not specify software dependencies like programming language versions (e.g., Python 3.x) or library versions (e.g., PyTorch 1.x) that are required for reproducibility.
Experiment Setup Yes We trained the audio reward model on synthetic dataset over 50 epochs, with a batch size 64 and learning rate 0.01 using the Adam [Kingma and Ba, 2014]. During the fine-tuning of the original model, we assigned a weight parameter β of 0.5 to the pretrain loss. The fine-tuning process, conducted over 10 epochs with settings including a learning rate of 1 × 10^−5, batch size of 6, and the default optimizer Adam W [Loshchilov and Hutter, 2017]