UniAudio: Towards Universal Audio Generation with Large Language Models
Authors: Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Zhou Zhao, Xixin Wu, Helen M. Meng
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, Uni Auido supports 11 audio generation tasks and achieves competitive results on all tasks consistently. We also show that Uni Audio can support new tasks seamlessly via simple fine-tuning and Experimentally, Uni Audio supports 11 audio generation tasks in total. The building process of Uni Audio is scaled up to 100k hours of audio and 1B parameters. Among the 11 tasks, Uni Audio consistently obtains competitive performance in both objective and subjective evaluations. We further conduct a comprehensive ablation study to verify that building this unified audio generation model by joint training is mutually beneficial to each task involved. |
| Researcher Affiliation | Collaboration | 1The Chinese University of Hong Kong, Hong Kong SAR, China 2Language Technologies Institute, Carnegie Mellon University, USA 3Microsoft Research Asia , China 4Zhejiang University, China 5Independent Researcher, China. |
| Pseudocode | No | No explicit pseudocode or algorithm block labeled 'Pseudocode' or 'Algorithm' was found. |
| Open Source Code | Yes | Demo and code are released http: //dongchaoyang.top/Uni Audio_demo/ |
| Open Datasets | Yes | Uni Audio is built on 12 datasets, all of which are publicly available. and 12 public datasets are adopted in this work for training. Besides, several test sets are additionally used only for zero-shot evaluation. The statistics of these datasets are in Table 8. |
| Dataset Splits | No | The paper specifies training and test sets but does not explicitly provide details about a validation set split for the main Uni Audio model. Appendix A.1, Table 8 lists 'Train Volume (hrs)' and 'Test set' for each task. |
| Hardware Specification | Yes | Both the training and fine-tuning are completed with 16 AMD MI200-64G GPUs. |
| Software Dependencies | No | The paper mentions several models and tools like 'pre-trained text T5 model', 'Transformer', and 'ESPNet tools', but does not specify specific software library versions (e.g., PyTorch, TensorFlow, or Python library versions) required for reproducibility. |
| Experiment Setup | Yes | Detailed model configuration is in Appendix A.2. and Table 10. Model configuration (with N = 3)... and Table 11. Optimization Configuration using Adam W optimizer... and We set the dropout rate as 0. The learning rate is 6e 4. For each batch, we set the sequence length as 6000. The activation function is Ge LU. The model is trained with Adam W optimizer. |