UniAudio: Towards Universal Audio Generation with Large Language Models

Authors: Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Zhou Zhao, Xixin Wu, Helen M. Meng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, Uni Auido supports 11 audio generation tasks and achieves competitive results on all tasks consistently. We also show that Uni Audio can support new tasks seamlessly via simple fine-tuning and Experimentally, Uni Audio supports 11 audio generation tasks in total. The building process of Uni Audio is scaled up to 100k hours of audio and 1B parameters. Among the 11 tasks, Uni Audio consistently obtains competitive performance in both objective and subjective evaluations. We further conduct a comprehensive ablation study to verify that building this unified audio generation model by joint training is mutually beneficial to each task involved.
Researcher Affiliation Collaboration 1The Chinese University of Hong Kong, Hong Kong SAR, China 2Language Technologies Institute, Carnegie Mellon University, USA 3Microsoft Research Asia , China 4Zhejiang University, China 5Independent Researcher, China.
Pseudocode No No explicit pseudocode or algorithm block labeled 'Pseudocode' or 'Algorithm' was found.
Open Source Code Yes Demo and code are released http: //dongchaoyang.top/Uni Audio_demo/
Open Datasets Yes Uni Audio is built on 12 datasets, all of which are publicly available. and 12 public datasets are adopted in this work for training. Besides, several test sets are additionally used only for zero-shot evaluation. The statistics of these datasets are in Table 8.
Dataset Splits No The paper specifies training and test sets but does not explicitly provide details about a validation set split for the main Uni Audio model. Appendix A.1, Table 8 lists 'Train Volume (hrs)' and 'Test set' for each task.
Hardware Specification Yes Both the training and fine-tuning are completed with 16 AMD MI200-64G GPUs.
Software Dependencies No The paper mentions several models and tools like 'pre-trained text T5 model', 'Transformer', and 'ESPNet tools', but does not specify specific software library versions (e.g., PyTorch, TensorFlow, or Python library versions) required for reproducibility.
Experiment Setup Yes Detailed model configuration is in Appendix A.2. and Table 10. Model configuration (with N = 3)... and Table 11. Optimization Configuration using Adam W optimizer... and We set the dropout rate as 0. The learning rate is 6e 4. For each batch, we set the sequence length as 6000. The activation function is Ge LU. The model is trained with Adam W optimizer.