reproducibilityindex.ai

Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing

Authors: Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Quantitative and qualitative results demonstrate that the proposed method performs favorably against existing methods on weakly-supervised audio-visual video parsing. We evaluate the proposed method on the LLP [4] dataset.
Researcher Affiliation	Collaboration	National Yang Ming Chiao Tung University UNC Chapel Hill UC Merced Snap Research Google Research Yonsei University
Pseudocode	No	The paper includes 'Figure 1: Algorithmic overview' which is a block diagram, but no structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code and models are publicly available.
Open Datasets	Yes	We use the Look, Listen and Parse (LLP) Dataset [4] for all experiments.
Dataset Splits	Yes	We use the 10000 video clips with only video-level event annotations for model training. The detailed annotations (e.g., individual audio and visual events per second) are available for the remaining 1849 validation and test videos.
Hardware Specification	Yes	We implement the proposed method using Py Torch [50], and conduct the training and evaluation processes on a single NVIDIA GTX 1080 Ti GPU with 11 GB memory.
Software Dependencies	No	The paper mentions 'Py Torch [50]' but does not specify a version number or other software dependencies with their versions.
Experiment Setup	No	The paper states 'Visual frames are sampled at 8 fps' and describes the feature extraction process, but it does not provide specific hyperparameters like learning rate, batch size, number of epochs, or optimizer settings.