MARGINALIZED AVERAGE ATTENTIONAL NETWORK FOR WEAKLY-SUPERVISED LEARNING

Authors: Yuan Yuan, Yueming Lyu, Xi Shen, Ivor W. Tsang, Dit-Yan Yeung

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two large-scale video datasets show that our MAAN achieves a superior performance on weakly-supervised temporal action localization.
Researcher Affiliation Collaboration Yuan Yuan12, Yueming Lyu3, Xi Shen4, Ivor W. Tsang3 & Dit-Yan Yeung1 1Hong Kong University of Science and Technology, 2Alibaba Group 3University of Technology Sydney, 4Ecole des Ponts Paris Tech
Pseudocode Yes Algorithm 1 Marginalized Average Aggregation Input: Feature Representations {x1, x2, x T } , Sampling Probability {p1, p2, p T }. Output: Aggregated Representation x Initialize m0 0 = 0, q0 0 = 1, bi = i i+1; for t = 1 to T do Set mt 0 = 0, and qt 1 = 0 and qt t+1 = 0; for i = 1 to t do qt i = ptqt 1 i 1 + (1 pt) qt 1 i mt i = pt bi 1mt 1 i 1 + (1 bi 1)qt 1 i 1xt + (1 pt)mt 1 i end for end for Return x = TP
Open Source Code No Our algorithm is implemented in Py Torch 2. https://github.com/pytorch/pytorch The provided link is for the PyTorch framework itself, not for the authors' specific implementation code.
Open Datasets Yes We evaluate MAAN on two popular action localization benchmark datasets, THUMOS14 (Jiang et al., 2014) and Activity Net1.3 (Heilbron et al., 2015).
Dataset Splits Yes THUMOS14 contains 200 untrimmed videos (3,027 action instances) in the validation set and 212 untrimmed videos (3,358 action instances) in the test set. Following standard practice, we train the models on the validation set without using the temporal annotations and evaluate them on the test set. Activity Net1.3 is a large-scale video benchmark for action detection which covers a wide range of complex human activities. It provides samples from 200 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 video hours. This dataset contains 10,024 training videos, 4,926 validation videos and 5,044 test videos.
Hardware Specification Yes We run all the experiments on a single NVIDIA Tesla M40 GPU with a 24 GB memory.
Software Dependencies No Our algorithm is implemented in Py Torch 2. The mention of 'Py Torch 2' is not specific enough, as it does not include a precise version number (e.g., 2.x).
Experiment Setup Yes We set T to 20 in our MAAN model. The attention module in Figure 3 consists of an FC layer of 1024 256, a Leaky Re LU layer, an FC layer of 256 1, and a sigmoid non-linear activation, to generate the latent discriminative probability pt. We pass the aggregated video-level representation through an FC layer of 1024 C followed by a sigmoid activation to obtain class scores. We use the ADAM optimizer (Kingma & Ba, 2014) with an initial learning rate of 5 10 4 to optimize network parameters. At the test time, we first reject classes whose video-level probabilities are below 0.1. We then forward all the snippets of the video to generate the CAS for the remaining classes. We generate the temporal proposals by cutting the CAS with a threshold th. The combination ratio of two-stream modalities is set to 0.5 and 0.5. We test all the models with the cutting threshold th as 0.2 of the max value of the CAS. We use a set of thresholds, which are [0.2, 0.15, 0.1, 0.05] of the max value of the CAS, to generate the proposals from the one-dimensional CAS.