Audio Generation with Multiple Conditional Diffusion Model
Authors: Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong Liu, Xiangdong Wang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our model successfully achieves finegrained control to accomplish controllable audio generation. |
| Researcher Affiliation | Collaboration | 1Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3Toshiba China R&D Center, Beijing, China |
| Pseudocode | No | The paper describes the model's architecture and processes in prose and figures, but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | Audio samples and our dataset are publicly available1. 1https://conditionaudiogen.github.io/conditionaudiogen/ - This statement explicitly refers to "Audio samples and our dataset" being available, not the source code for the model or methodology. |
| Open Datasets | Yes | We integrate the existing datasets to create a new dataset for this task, which contains audio, corresponding text, and control conditions. ... We randomly split Audio Condition into three sets: 89557 samples for training, 1398 samples for validation, and 1110 samples for testing, which are publicly available1. |
| Dataset Splits | Yes | We randomly split Audio Condition into three sets: 89557 samples for training, 1398 samples for validation, and 1110 samples for testing, which are publicly available1. |
| Hardware Specification | No | The paper describes its model architecture and components, including the use of a pre-trained TTA model (Tango) and specific LLMs (FLAN-T5-LARGE), but does not specify the hardware (e.g., GPU models, CPU types, memory) used for training or running experiments. |
| Software Dependencies | No | The paper mentions specific pre-trained models and vocoders (e.g., FLAN-T5-LARGE, Tango, Hi Fi-GAN) and references signal processing tools, but does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | In this section, we present the experiment setup, including the model configuration and baseline models. ... Similarly on Audio Condition test set in Table 4, we have observed that these two parameters also play a significant role in control: (1) Classifierfree guidance scales: in the absence of classifier-free guidance whose scale is set to 1, the performance is poor only when the control condition is the timestamp. By increasing the scale to 5, improved performance is achieved across most evaluation metrics. However, further increment in scale leads to a decline in performance, which may be the diversity brought by larger scales hinders the controllability of the model. (2) Inference steps: temporal order control reaches its optimum at step 100, energy control reaches its optimum at step 200, but pitch control does not have a clearly suitable step that allows all indicators to reach their best values, since pitch could be more difficult to model. |