Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation
Authors: Yao Yao, Peike Li, Boyu Chen, Alex Wang
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted extensive experiments to evaluate the capabilities of JEN-1 Composer, focusing on its performance across various dimensions to understand its potential in real-world applications. ... Our evaluation employs both quantitative and qualitative metrics to assess the model s performance. ... We also use Fr echet Audio Distance (FAD) (Roblek et al. 2019) as metric. ... Our ablation studies underscore the significance of each component within the JEN-1 Composer framework. As summarized in Table 2, we started with a baseline model featuring a four-track input-output configuration inspired by Jen1 (Li et al. 2024), and incrementally introduced our proposed enhancements. |
| Researcher Affiliation | Industry | Yao Yao, Peike Li*, Boyu Chen, Alex Wang Jen Music AI EMAIL |
| Pseudocode | Yes | Algorithm 1: Human-AI Co-composition Workflow 1: Input: Text prompt, user-provided tracks S (optional) 2: Output: Set of selected and refined tracks S 3: e Embedding of the given prompt 4: while S is empty do 5: # Joint Generation 6: (bx1 0, . . . , bx K 0 ) Model.Generate Tracks(e) 7: S User.select And Refine Tracks(bx1 0, . . . , bx K 0 ) 8: end while 9: while not all K tracks are satisfactory do 10: # Using the CFG technique defined in Equation (6) 11: (bx1 0, . . . , bx K 0 ) Model.Generate Tracks(S, e) 12: # Update S 13: S S User.select And Refine Tracks(bx1 0, . . . , bx K 0 ) 14: end while |
| Open Source Code | No | The paper mentions a demo link (https://www.jenmusic.ai/audio-demos) but does not provide an explicit statement about releasing the source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | To evaluate multi-track generation quality, we performed a zero-shot comparison on the Slakh2100 dataset (Manilow et al. 2019), without any fine-tuning. |
| Dataset Splits | Yes | The dataset is divided into 640 hours for training and 160 hours for testing. |
| Hardware Specification | Yes | Training was conducted on two NVIDIA A100 GPUs, with hyperparameters including the Adam W optimizer (Loshchilov and Hutter 2018), a linear decay learning rate starting at 3e-5, a batch size of 12 per GPU, and optimization settings of β1 = 0.9, β2 = 0.95, weight decay of 0.1, and a gradient clipping threshold of 0.7. |
| Software Dependencies | Yes | We utilize the 48k version of the pre-trained En Codec (D efossez et al. 2022), resulting in a latent space representation of 150 frames per second, each with 128 dimensions. ... For text encoding, we employ the pre-trained Flan-T5-Large model (Chung et al. 2024), which provides robust capabilities for understanding and processing complex textual inputs. |
| Experiment Setup | Yes | Training was conducted on two NVIDIA A100 GPUs, with hyperparameters including the Adam W optimizer (Loshchilov and Hutter 2018), a linear decay learning rate starting at 3e-5, a batch size of 12 per GPU, and optimization settings of β1 = 0.9, β2 = 0.95, weight decay of 0.1, and a gradient clipping threshold of 0.7. ... Sampler settings for conditional and marginal generation are optimized with p1 = 0.8. After 300 epochs, self-bootstrapping training is introduced with a probability p2 = 0.5. We determined the optimal value for the guidance scale parameter λ = 7 through a grid search. |