reproducibilityindex.ai

Images that Sound: Composing Images and Sounds on a Single Canvas

Authors: Ziyang Chen, Daniel Geng, Andrew Owens

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our methods using quantitative metrics and human studies. We also present qualitative comparisons and an analysis of our method, and why it works.
Researcher Affiliation	Academia	Ziyang Chen Daniel Geng Andrew Owens University of Michigan
Pseudocode	Yes	Algorithm 1 Pseudocode in a Py Torch-like style for the imprint baseline.
Open Source Code	Yes	We provide all code in our supplemental material, including code for baselines. We will release all code on acceptance.
Open Datasets	Yes	We randomly select 5 discrete (onset-based) and 5 continuous sound category names from VGGSound Common [17] as audio prompts. We randomly chose 5 objects and 5 scene classes for image prompts, formatted as a painting of [class], grayscale . For the image model, we use Stable Diffusion v1.53 [96]. For the audio model, we use Auffusion4 [118].
Dataset Splits	No	The paper describes generating samples based on prompts and evaluating them, but does not specify explicit training, validation, or test dataset splits for its own experimental process, as it leverages pre-trained models.
Hardware Specification	Yes	Our method is significantly faster, generating one sample in 10 seconds compared to the SDS baseline s 2-hour optimization time using NVIDIA L40s.
Software Dependencies	No	The paper mentions specific models like Stable Diffusion v1.5 and Deep Floyd IF, and algorithms like Hi Fi-GAN and Griffin-Lim, but does not provide version numbers for general software dependencies (e.g., Python, PyTorch) or for Hi Fi-GAN and Griffin-Lim implementations used.
Experiment Setup	Yes	We begin the reverse process with random latent noise z T R4 32 128, the same shape that Auffusion was trained on. ... We set the classifier guidance scales γv and γa to be between 7.5 and 10 and denoise the latents for 100 inference steps with warm-start parameters of ta = 1.0, tv = 0.9 to preserve audio priors. We decode the latent variables into images of dimension 3 256 1024.