Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation

Authors: Yao Yao, Peike Li, Boyu Chen, Alex Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted extensive experiments to evaluate the capabilities of JEN-1 Composer, focusing on its performance across various dimensions to understand its potential in real-world applications. ... Our evaluation employs both quantitative and qualitative metrics to assess the model s performance. ... We also use Fr echet Audio Distance (FAD) (Roblek et al. 2019) as metric. ... Our ablation studies underscore the significance of each component within the JEN-1 Composer framework. As summarized in Table 2, we started with a baseline model featuring a four-track input-output configuration inspired by Jen1 (Li et al. 2024), and incrementally introduced our proposed enhancements.
Researcher Affiliation Industry Yao Yao, Peike Li*, Boyu Chen, Alex Wang Jen Music AI EMAIL
Pseudocode Yes Algorithm 1: Human-AI Co-composition Workflow 1: Input: Text prompt, user-provided tracks S (optional) 2: Output: Set of selected and refined tracks S 3: e Embedding of the given prompt 4: while S is empty do 5: # Joint Generation 6: (bx1 0, . . . , bx K 0 ) Model.Generate Tracks(e) 7: S User.select And Refine Tracks(bx1 0, . . . , bx K 0 ) 8: end while 9: while not all K tracks are satisfactory do 10: # Using the CFG technique defined in Equation (6) 11: (bx1 0, . . . , bx K 0 ) Model.Generate Tracks(S, e) 12: # Update S 13: S S User.select And Refine Tracks(bx1 0, . . . , bx K 0 ) 14: end while
Open Source Code No The paper mentions a demo link (https://www.jenmusic.ai/audio-demos) but does not provide an explicit statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets Yes To evaluate multi-track generation quality, we performed a zero-shot comparison on the Slakh2100 dataset (Manilow et al. 2019), without any fine-tuning.
Dataset Splits Yes The dataset is divided into 640 hours for training and 160 hours for testing.
Hardware Specification Yes Training was conducted on two NVIDIA A100 GPUs, with hyperparameters including the Adam W optimizer (Loshchilov and Hutter 2018), a linear decay learning rate starting at 3e-5, a batch size of 12 per GPU, and optimization settings of β1 = 0.9, β2 = 0.95, weight decay of 0.1, and a gradient clipping threshold of 0.7.
Software Dependencies Yes We utilize the 48k version of the pre-trained En Codec (D efossez et al. 2022), resulting in a latent space representation of 150 frames per second, each with 128 dimensions. ... For text encoding, we employ the pre-trained Flan-T5-Large model (Chung et al. 2024), which provides robust capabilities for understanding and processing complex textual inputs.
Experiment Setup Yes Training was conducted on two NVIDIA A100 GPUs, with hyperparameters including the Adam W optimizer (Loshchilov and Hutter 2018), a linear decay learning rate starting at 3e-5, a batch size of 12 per GPU, and optimization settings of β1 = 0.9, β2 = 0.95, weight decay of 0.1, and a gradient clipping threshold of 0.7. ... Sampler settings for conditional and marginal generation are optimized with p1 = 0.8. After 300 epochs, self-bootstrapping training is introduced with a probability p2 = 0.5. We determined the optimal value for the guidance scale parameter λ = 7 through a grid search.