Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners

Authors: Fang-Duo Tsai, Shih-Lun Wu, Weijaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross-attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting, demonstrating improved controllability over Music Gen-Large and Stable Audio Open Control Net at a significantly lower fine-tuning cost, with only 85M trainable parameters.
Researcher Affiliation Academia 1National Taiwan University, Taipei, Taiwan. 2Massachusetts Institute of Technology, Cambridge, MA, United States.
Pseudocode No The paper describes methods using mathematical equations and textual explanations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Source code, model checkpoints, and demo examples are available at: https: //Muse Control Lite.github.io/web/.
Open Datasets Yes For training, we utilize the open-source MTG-Jamendo dataset (Bogdanov et al., 2019) and preprocess the data following this pipeline... For evaluation, we adopt the methodology outlined by Evans et al. (2024a;c;b). Specifically, we utilize the instrumental subset of the Song Describer dataset (Manco et al., 2023)
Dataset Splits Yes Additionally, we remove any samples that overlap with the Song Describer dataset (Manco et al., 2023), as this dataset is reserved exclusively for evaluation purposes. This yields an evaluation set comprising 586 audio clips. All 586 clips are used in the melody-conditioned generation experiments described in Section 5.1... To enable style transfer evaluation, we split the 586 audio clips into two disjoint subsets... For the audio inpainting and outpainting tasks in Section 5.3, we randomly selected a smaller subset of 50 clips from the original 586, without applying the style transfer setting.
Hardware Specification Yes The model is trained for 40,000 steps with an effective batch size of 128 on a single NVIDIA RTX 3090.
Software Dependencies No The paper mentions software like PyTorch's interpolate function, PANNs, and Qwen2-Audio-7B-Instruct but does not specify their version numbers.
Experiment Setup Yes We use a batch size of 128, a constant learning rate of 10 4, and a weight decay of 10 2. To encourage the model to focus on cattr or caudio, we drop the text condition in 30% of training iterations. Additionally, each condition is independently dropped with a probability of 50% and subjected to random masking. The model is trained for 40,000 steps with an effective batch size of 128 on a single NVIDIA RTX 3090. For inference, we fix the separate guidance scales as shown in Table 2. We use 50 denoising steps to generate 47-second audio clips.