Any-to-Any Generation via Composable Diffusion

Authors: Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the any-to-any generation capability of Co Di, including single-to-single modality generation, multi-condition generation, and the novel capacity of joint generation of multiple modalities. For example, generating synchronized video and audio given the text input prompt; or generating video given a prompt image and audio. We also provide a quantitative evaluation of Co Di using eight multimodal datasets.
Researcher Affiliation Collaboration Zineng Tang1 Ziyi Yang2 Chenguang Zhu2 Michael Zeng2 Mohit Bansal1 1University of North Carolina at Chapel Hill 2Microsoft Azure Cognitive Services Research
Pseudocode No The paper includes architectural diagrams (Figure 2, 6, 7) and descriptions of methods, but no explicit pseudocode blocks or algorithms.
Open Source Code Yes The project page with demonstrations and code is at https://codi-gen.github.io/ [...] We will publicly release our code and checkpoints.
Open Datasets Yes We list training tasks of Co Di in Table 1, including single modality synthesis, joint multimodal generation, and contrastive learning to align prompt encoders. Table 1 provides an overview of the datasets, tasks, number of samples, and domain. [...] LAION-400M: Creative Common CC-BY 4.0 Audio Set: Creative Common CC-BY 4.0 Audio Caps: MIT Freesound: Creative Commons BBC Sound Effect: The BBC s Content Licence Sound Net: MIT Webvid10M: Webvid HD-Villa-100M: Research Use of Data Agreement v1.0
Dataset Splits Yes We test on the validation set of Audio Caps [24] since all four modalities are present in this dataset. [...] The benchmark is the validation set of Audio Caps [24].
Hardware Specification No The paper does not specify the hardware used for training or inference, such as specific GPU or CPU models.
Software Dependencies No The paper lists general software libraries like “Py Torch”, “Huggingface Transformers”, “Torchvision”, and “Torchaudio” along with their licenses, but does not provide specific version numbers for any of them.
Experiment Setup Yes Table 12: Hyperparameters for our diffusion models. [...] Diffusion Setup Diffusion steps 1000 [...] Learning rate 2e-5. [...] Section B Model Training [...] We use Adam [26] optimizer with learning rate 1e-4 and weight decay 1e-4. [...] We adopt curriculum learning on frame resolution and frames-per-second (FPS). First, the diffuser is trained on the Web Vid dataset of a 256-frame resolution, with the training objective being text-conditioned video generation. The training clips are sampled from 2-second video chunks with 4 FPS. Second, the model is further trained on HDVILLA and ACAV datasets, with a 512-frame resolution and 8 FPS, and the training objective is image-conditioned video generation (the image is a randomly sampled frame of the clip). Each training clip contains 16 frames sampled from a 2-second video chunk with 8 FPS.