Images that Sound: Composing Images and Sounds on a Single Canvas

Authors: Ziyang Chen, Daniel Geng, Andrew Owens

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our methods using quantitative metrics and human studies. We also present qualitative comparisons and an analysis of our method, and why it works.
Researcher Affiliation Academia Ziyang Chen Daniel Geng Andrew Owens University of Michigan
Pseudocode Yes Algorithm 1 Pseudocode in a Py Torch-like style for the imprint baseline.
Open Source Code Yes We provide all code in our supplemental material, including code for baselines. We will release all code on acceptance.
Open Datasets Yes We randomly select 5 discrete (onset-based) and 5 continuous sound category names from VGGSound Common [17] as audio prompts. We randomly chose 5 objects and 5 scene classes for image prompts, formatted as a painting of [class], grayscale . For the image model, we use Stable Diffusion v1.53 [96]. For the audio model, we use Auffusion4 [118].
Dataset Splits No The paper describes generating samples based on prompts and evaluating them, but does not specify explicit training, validation, or test dataset splits for its own experimental process, as it leverages pre-trained models.
Hardware Specification Yes Our method is significantly faster, generating one sample in 10 seconds compared to the SDS baseline s 2-hour optimization time using NVIDIA L40s.
Software Dependencies No The paper mentions specific models like Stable Diffusion v1.5 and Deep Floyd IF, and algorithms like Hi Fi-GAN and Griffin-Lim, but does not provide version numbers for general software dependencies (e.g., Python, PyTorch) or for Hi Fi-GAN and Griffin-Lim implementations used.
Experiment Setup Yes We begin the reverse process with random latent noise z T R4 32 128, the same shape that Auffusion was trained on. ... We set the classifier guidance scales γv and γa to be between 7.5 and 10 and denoise the latents for 100 inference steps with warm-start parameters of ta = 1.0, tv = 0.9 to preserve audio priors. We decode the latent variables into images of dimension 3 256 1024.