Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation
Authors: Feizhen Huang, Yu Wu, Yutian Lin, Bo Du
IJCAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound. ... We conduct our experiments using VGGSound [Chen et al., 2020a], a large-scale audio-visual dataset... Table 1: Evaluation results for Video-to-Audio generation across three test sets... Table 3: We explore the effect of different cinematic language variations f during training... |
| Researcher Affiliation | Academia | Feizhen Huang , Yu Wu , Yutian Lin and Bo Du School of Computer Science, Wuhan University EMAIL |
| Pseudocode | No | The paper describes methods and models using mathematical equations and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a specific link to source code, nor does it contain an explicit statement about releasing its code in supplementary materials or otherwise. It mentions building upon 'the open-source Diff Foley [Luo et al., 2024]' but this refers to a third-party tool, not the authors' own implementation code. |
| Open Datasets | Yes | We conduct our experiments using VGGSound [Chen et al., 2020a], a large-scale audio-visual dataset containing over 200,000 video clips across 309 distinct sound categories. |
| Dataset Splits | Yes | We follow the original VGGSound train/test split. ... To evaluate performance under partial visibility, we create two modified test sets by applying cinematic language variations to the VGGSound [Chen et al., 2020a] test set. |
| Hardware Specification | Yes | The student model is trained for 25 epochs on 4 NVIDIA 4090 GPUs, using the Adam W optimizer with a learning rate of 5 10 4 and a total batch size of 32. |
| Software Dependencies | No | The paper mentions using a pre-trained video encoder from CAVP [Luo et al., 2024] and building upon Diff Foley [Luo et al., 2024], but it does not specify version numbers for these or any other software components (e.g., programming languages, libraries, frameworks). |
| Experiment Setup | Yes | The input video clips are sampled at 4 frames per second (FPS)... For training, we only apply cinematic language variation fcu on VGGSound [Chen et al., 2020a] training set with k = 75%, where a1 = 0.4 and a2 = 0.6. The student model is trained for 25 epochs on 4 NVIDIA 4090 GPUs, using the Adam W optimizer with a learning rate of 5 10 4 and a total batch size of 32. ... we use only CFG [Ho and Salimans, 2022] configuration in Diff-Foley, keeping all other experimental settings unchanged, including the DPM-Solver [Lu et al., 2022] Sampler with 25 inference steps and CFG scale ̗ = 4.5. |