Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing

Authors: Shentong Mo, Yapeng Tian

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the LLP [3] dataset validate that our new audio-visual video parsing framework achieves superior results over previous state-of-the-art methods [1, 2, 3, 4]. Empirical results also demonstrate the generalizability of our approach to contrastive learning and label refinement proposed in MA [4].
Researcher Affiliation Academia Shentong Mo Carnegie Mellon University Yapeng Tian University of Texas at Dallas
Pseudocode No The paper does not contain an explicitly labeled pseudocode or algorithm block.
Open Source Code Yes Code is available at https://github.com/stone Mo/MGN.
Open Datasets Yes The Look, Listen and Parse (LLP) Dataset [3] contains 11,849 You Tube video clips of 10-seconds long from 25 different event categories, such as car, music, cheering, speech, etc.
Dataset Splits Yes We use 10,000 video clips with only video-level event labels for training. Following the official splits [3] of validation and test sets, we develop and test the model on the remaining 1879 videos with the segment-level annotations, i.e., the speech event for audio starts at 1s and ends at 5s.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU model, CPU type).
Software Dependencies No The paper mentions software components like ResNet-152, 3D ResNet, VGGish, and Adam optimizer, but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes The model is trained with Adam [41] optimizer with β1=0.9, β2=0.999 and with an initial learning rate of 3e-4. We train the model with a batch size of 16 for 40 epochs.