Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
Authors: Shentong Mo, Yapeng Tian
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the LLP [3] dataset validate that our new audio-visual video parsing framework achieves superior results over previous state-of-the-art methods [1, 2, 3, 4]. Empirical results also demonstrate the generalizability of our approach to contrastive learning and label refinement proposed in MA [4]. |
| Researcher Affiliation | Academia | Shentong Mo Carnegie Mellon University Yapeng Tian University of Texas at Dallas |
| Pseudocode | No | The paper does not contain an explicitly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Code is available at https://github.com/stone Mo/MGN. |
| Open Datasets | Yes | The Look, Listen and Parse (LLP) Dataset [3] contains 11,849 You Tube video clips of 10-seconds long from 25 different event categories, such as car, music, cheering, speech, etc. |
| Dataset Splits | Yes | We use 10,000 video clips with only video-level event labels for training. Following the official splits [3] of validation and test sets, we develop and test the model on the remaining 1879 videos with the segment-level annotations, i.e., the speech event for audio starts at 1s and ends at 5s. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU model, CPU type). |
| Software Dependencies | No | The paper mentions software components like ResNet-152, 3D ResNet, VGGish, and Adam optimizer, but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | The model is trained with Adam [41] optimizer with β1=0.9, β2=0.999 and with an initial learning rate of 3e-4. We train the model with a batch size of 16 for 40 epochs. |