Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization
Authors: Jun-Tae Lee, Mihir Jain, Hyoungwoo Park, Sungrack Yun
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two publicly available video-datasets (AVE and Activity Net1.2) show that the proposed method effectively fuses audio and visual modalities, and achieves the state-of-the-art results for weakly-supervised action localization. |
| Researcher Affiliation | Industry | Juntae Lee, Mihir Jain, Hyoungwoo Park & Sungrack Yun Qualcomm AI Research . {juntlee,mijain,hwoopark,sungrack}@qti.qualcomm.com |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include a statement about releasing its source code or provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | Extensive experiments are conducted on two video datasets for localizing audio-visual events (AVE1) and actions (Activity Net1.22). 1https://github.com/Yapeng Tian/AVE-ECCV18 2http://activity-net.org/download.html |
| Dataset Splits | Yes | Activity Net1.2 is a temporal action localization dataset with 4,819 train and 2,383 validation videos, which in the literature is used for evaluation. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run its experiments. |
| Software Dependencies | No | The paper mentions using specific networks like I3D, ResNet152, and VGG-like network for feature extraction, but it does not specify versions for software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We set dx to 1,024, and the Leaky Relu and hyperbolic tangent functions are respectively used for the activation of modality-specific layers and cross-attention modules. In training, the parameters are initialized with Xavier method (Glorot & Bengio, 2010) and updated by Adam optimizer (Kingma & Ba, 2015) with the learning rate of 10 4 and the batch size of 30. Also, the dropout with a ratio of 0.7 is applied for the final attended audio-visual features. In the loss, the hyper parameters are set as B = 4, α = 0.8, β = 0.8 and γ = 1. |