reproducibilityindex.ai

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Authors: Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse. We compare our method to state-of-the-art approaches, both in terms of objective evaluation and human study. We evaluate the audio-video alignment as well as video quality and diversity. To capture temporal alignment, we devise a new metric based on detecting energy peaks in both modalities separately and measuring their alignment. Further, we provide an ablation study where we consider alternative approaches to condition the video model.
Researcher Affiliation	Collaboration	Guy Yariv1,2, Itai Gat3, Sagie Benaim1, Lior Wolf4, Idan Schwartz4,2, , Yossi Adi1* 1The Hebrew University of Jerusalem, 2Net App, 3Technion, 4Tel-Aviv University
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/Tempo Tokens/. Code and pretrained models will be publicly available upon acceptance.
Open Datasets	Yes	We consider the Landscape dataset (Lee et al. 2022), which captures landscape videos. The Audio Set-Drums dataset (Gemmeke et al. 2017) which captures drums videos, and the VGGSound dataset (Chen et al. 2020) which consists of a diverse set of real-world videos from 309 different semantic classes.
Dataset Splits	No	For the Audio Set-Drum dataset, it states: 'We used the same split as proposed by Ge et al. (2022), where 6k is used as the training set while the rest serves as a test set.' However, no explicit validation split percentage or count is provided for any dataset, nor are details for the VGGSound or Landscape datasets.
Hardware Specification	Yes	We optimized the model using two A6000 GPUs for 10K iterations.
Software Dependencies	No	The paper mentions using specific models like BEATs and Model Scope, and metrics like CLIP, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries.
Experiment Setup	Yes	The proposed method contains 35M trainable parameters. We optimized the model using two A6000 GPUs for 10K iterations. We use Adam W optimizer with learning rate of 1e-05 using constant learning rate scheduler. Each batch comprises 8 videos with 24 frames per video, sampled randomly for one-second granularity. To enhance training efficiency and mitigate memory consumption, we integrated gradient checkpointing into the training process of the 3D U-net architecture.