Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
Authors: Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse. We compare our method to state-of-the-art approaches, both in terms of objective evaluation and human study. We evaluate the audio-video alignment as well as video quality and diversity. To capture temporal alignment, we devise a new metric based on detecting energy peaks in both modalities separately and measuring their alignment. Further, we provide an ablation study where we consider alternative approaches to condition the video model. |
| Researcher Affiliation | Collaboration | Guy Yariv1,2, Itai Gat3, Sagie Benaim1, Lior Wolf4, Idan Schwartz4,2, , Yossi Adi1* 1The Hebrew University of Jerusalem, 2Net App, 3Technion, 4Tel-Aviv University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/Tempo Tokens/. Code and pretrained models will be publicly available upon acceptance. |
| Open Datasets | Yes | We consider the Landscape dataset (Lee et al. 2022), which captures landscape videos. The Audio Set-Drums dataset (Gemmeke et al. 2017) which captures drums videos, and the VGGSound dataset (Chen et al. 2020) which consists of a diverse set of real-world videos from 309 different semantic classes. |
| Dataset Splits | No | For the Audio Set-Drum dataset, it states: 'We used the same split as proposed by Ge et al. (2022), where 6k is used as the training set while the rest serves as a test set.' However, no explicit validation split percentage or count is provided for any dataset, nor are details for the VGGSound or Landscape datasets. |
| Hardware Specification | Yes | We optimized the model using two A6000 GPUs for 10K iterations. |
| Software Dependencies | No | The paper mentions using specific models like BEATs and Model Scope, and metrics like CLIP, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | The proposed method contains 35M trainable parameters. We optimized the model using two A6000 GPUs for 10K iterations. We use Adam W optimizer with learning rate of 1e-05 using constant learning rate scheduler. Each batch comprises 8 videos with 24 frames per video, sampled randomly for one-second granularity. To enhance training efficiency and mitigate memory consumption, we integrated gradient checkpointing into the training process of the 3D U-net architecture. |