Decompose the Sounds and Pixels, Recompose the Events
Authors: Varshanth R. Rao, Md Ibrahim Khalil, Haoda Li, Peng Dai, Juwei Lu2144-2152
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the AVE dataset show that our collective framework outperforms the state-of-the-art by a sizable margin. Benchmarking against So TAs In Table 2, we compare our EDRNet framework to the Audiovisual Transformer (AVT) (Lin and Wang 2020) and Positive Sample Propagation (PSP) Network (Zhou et al. 2021) on the SEL and WSEL tasks on O-AVE and C-AVE datasets. |
| Researcher Affiliation | Collaboration | Varshanth R. Rao1, Md Ibrahim Khalil1,2, Haoda Li1,3, Peng Dai1 , Juwei Lu1 1Huawei Noah s Ark Lab 2University of Waterloo, Canada 3University of Toronto, Canada |
| Pseudocode | Yes | Algorithm 1: SMB Intra-Class Video Fusion |
| Open Source Code | No | The paper does not include any statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | The AVE dataset (Tian et al. 2018) is a subset of the Audio Set (Gemmeke et al. 2017) containing 4143 videos, each 10 seconds long, i.e., N = 10. There are 28 (+1 for BG) diverse event classes covering vehicle sounds, animal activity, instrument performances, etc. Video and segment level labels are available with clearly demarcated temporal boundaries. |
| Dataset Splits | Yes | We adopt the same train/validation/test split as (Tian et al. 2018). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions). |
| Experiment Setup | Yes | For a fair comparison with prior works, we utilize the same extracted audio and visual features (provided with the AVE dataset) using VGGish (Hershey et al. 2017) and VGG19 (Simonyan and Zisserman 2014) networks pretrained on Audio Set (Gemmeke et al. 2017) and Image Net (Russakovsky et al. 2015) respectively. We configure the EDRNet with k = 3, L = 4, and a network width dl = 768 for all layers. Sourcing the training set, we generate 250 samples per category using SMB video fusion. The optimization parameters for training EDRNet are specified in the Supplementary Material. Hyperparameter tuning was performed using the validation set. |