Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing
Authors: Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Quantitative and qualitative results demonstrate that the proposed method performs favorably against existing methods on weakly-supervised audio-visual video parsing. We evaluate the proposed method on the LLP [4] dataset. |
| Researcher Affiliation | Collaboration | National Yang Ming Chiao Tung University UNC Chapel Hill UC Merced Snap Research Google Research Yonsei University |
| Pseudocode | No | The paper includes 'Figure 1: Algorithmic overview' which is a block diagram, but no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and models are publicly available. |
| Open Datasets | Yes | We use the Look, Listen and Parse (LLP) Dataset [4] for all experiments. |
| Dataset Splits | Yes | We use the 10000 video clips with only video-level event annotations for model training. The detailed annotations (e.g., individual audio and visual events per second) are available for the remaining 1849 validation and test videos. |
| Hardware Specification | Yes | We implement the proposed method using Py Torch [50], and conduct the training and evaluation processes on a single NVIDIA GTX 1080 Ti GPU with 11 GB memory. |
| Software Dependencies | No | The paper mentions 'Py Torch [50]' but does not specify a version number or other software dependencies with their versions. |
| Experiment Setup | No | The paper states 'Visual frames are sampled at 8 fps' and describes the feature extraction process, but it does not provide specific hyperparameters like learning rate, batch size, number of epochs, or optimizer settings. |