Decompose the Sounds and Pixels, Recompose the Events

Authors: Varshanth R. Rao, Md Ibrahim Khalil, Haoda Li, Peng Dai, Juwei Lu2144-2152

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the AVE dataset show that our collective framework outperforms the state-of-the-art by a sizable margin. Benchmarking against So TAs In Table 2, we compare our EDRNet framework to the Audiovisual Transformer (AVT) (Lin and Wang 2020) and Positive Sample Propagation (PSP) Network (Zhou et al. 2021) on the SEL and WSEL tasks on O-AVE and C-AVE datasets.
Researcher Affiliation Collaboration Varshanth R. Rao1, Md Ibrahim Khalil1,2, Haoda Li1,3, Peng Dai1 , Juwei Lu1 1Huawei Noah s Ark Lab 2University of Waterloo, Canada 3University of Toronto, Canada
Pseudocode Yes Algorithm 1: SMB Intra-Class Video Fusion
Open Source Code No The paper does not include any statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes The AVE dataset (Tian et al. 2018) is a subset of the Audio Set (Gemmeke et al. 2017) containing 4143 videos, each 10 seconds long, i.e., N = 10. There are 28 (+1 for BG) diverse event classes covering vehicle sounds, animal activity, instrument performances, etc. Video and segment level labels are available with clearly demarcated temporal boundaries.
Dataset Splits Yes We adopt the same train/validation/test split as (Tian et al. 2018).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup Yes For a fair comparison with prior works, we utilize the same extracted audio and visual features (provided with the AVE dataset) using VGGish (Hershey et al. 2017) and VGG19 (Simonyan and Zisserman 2014) networks pretrained on Audio Set (Gemmeke et al. 2017) and Image Net (Russakovsky et al. 2015) respectively. We configure the EDRNet with k = 3, L = 4, and a network width dl = 768 for all layers. Sourcing the training set, we generate 250 samples per category using SMB video fusion. The optimization parameters for training EDRNet are specified in the Supplementary Material. Hyperparameter tuning was performed using the validation set.