Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

Authors: Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, Yan Yan

ICLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on three conditional generation tasks: dance-to-music generation, text-to-image synthesis, and class-conditioned image synthesis. ... The quantitative experimental results are shown in Tab. 1 and Tab. 2.
Researcher Affiliation Collaboration Ye Zhu Department of Computer Science Illinois Institute of Technology Chicago, IL 60616, USA EMAIL Yu Wu School of Computer Science Wuhan University Wuhan 430000, China EMAIL Kyle Olszewski, Jian Ren, Sergey Tulyakov Snap Inc. Santa Monica, CA 90405, USA EMAIL Yan Yan Department of Computer Science Illinois Institute of Technology Chicago, IL 60616, USA EMAIL
Pseudocode Yes Algorithm 1 Conditional Discrete Contrastive Diffusion Training. The referenced equations can be found in the main paper.
Open Source Code No The paper states 'For qualitative samples of synthesized dance music sequences, please refer to our anonymous page in the supplement with music samples,' indicating an anonymous link for review, but there is no explicit statement about releasing the code for the methodology or a public repository link.
Open Datasets Yes We use the AIST++ Li et al. (2021) dataset and the Tik Tok Dance-Music dataset Zhu et al. (2022a) for the dance-to-music experiments. ... We conduct text-to-image synthesis on CUB200 Wah et al. (2011) and MSCOCO datasets Lin et al. (2014). ... We also perform the class-conditioned image generation on Image Net Deng et al. (2009); Russakovsky et al. (2015).
Dataset Splits Yes We adopt the official cross-modality splits without overlapping music songs for both datasets.
Hardware Specification Yes For the dance2music task experiments on the AIST++ dataset, we use 4 NVIDIA RTX A5000 GPUs, and train the model for approximately 2 days. ... For the same experiments on the MSCOCO dataset, we run the experiments on Amazon Web Services (AWS) using 8 NVIDIA Tesla V100 GPUs. ... For the class-conditioned experiments on the Image Net, we use 8 NVIDIA Tesla V100 GPUs running on AWS.
Software Dependencies No The paper mentions using 'Adam W Loshchilov & Hutter (2017) optimizer' and implicitly relies on deep learning frameworks, but it does not specify version numbers for any programming languages, libraries, or other key software components.
Experiment Setup Yes We set the initial weight for the contrastive loss as λ = 5e 5. The number N of intraand inter-negative samples for each GT music sample is 10. ... The Adam W Loshchilov & Hutter (2017) optimizer with β1 = 0.9 and β2 = 0.96 is deployed in our training, with a learning rate of 4.5e 4. ... We adopt a truncation rate of 0.86 in our inference.