Audio Generation with Multiple Conditional Diffusion Model

Authors: Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong Liu, Xiangdong Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our model successfully achieves finegrained control to accomplish controllable audio generation.
Researcher Affiliation Collaboration 1Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3Toshiba China R&D Center, Beijing, China
Pseudocode No The paper describes the model's architecture and processes in prose and figures, but does not include explicit pseudocode or algorithm blocks.
Open Source Code No Audio samples and our dataset are publicly available1. 1https://conditionaudiogen.github.io/conditionaudiogen/ - This statement explicitly refers to "Audio samples and our dataset" being available, not the source code for the model or methodology.
Open Datasets Yes We integrate the existing datasets to create a new dataset for this task, which contains audio, corresponding text, and control conditions. ... We randomly split Audio Condition into three sets: 89557 samples for training, 1398 samples for validation, and 1110 samples for testing, which are publicly available1.
Dataset Splits Yes We randomly split Audio Condition into three sets: 89557 samples for training, 1398 samples for validation, and 1110 samples for testing, which are publicly available1.
Hardware Specification No The paper describes its model architecture and components, including the use of a pre-trained TTA model (Tango) and specific LLMs (FLAN-T5-LARGE), but does not specify the hardware (e.g., GPU models, CPU types, memory) used for training or running experiments.
Software Dependencies No The paper mentions specific pre-trained models and vocoders (e.g., FLAN-T5-LARGE, Tango, Hi Fi-GAN) and references signal processing tools, but does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In this section, we present the experiment setup, including the model configuration and baseline models. ... Similarly on Audio Condition test set in Table 4, we have observed that these two parameters also play a significant role in control: (1) Classifierfree guidance scales: in the absence of classifier-free guidance whose scale is set to 1, the performance is poor only when the control condition is the timestamp. By increasing the scale to 5, improved performance is achieved across most evaluation metrics. However, further increment in scale leads to a decline in performance, which may be the diversity brought by larger scales hinders the controllability of the model. (2) Inference steps: temporal order control reaches its optimum at step 100, energy control reaches its optimum at step 200, but pitch control does not have a clearly suitable step that allows all indicators to reach their best values, since pitch could be more difficult to model.