Multimodal Keyless Attention Fusion for Video Classification

Authors: Xiang Long, Chuang Gan, Gerard Melo, Xiao Liu, Yandong Li, Fu Li, Shilei Wen

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment on four highly heterogeneous datasets, UCF101, Activity Net, Kinetics, and You Tube-8M to validate our conclusion, and show that our approach achieves highly competitive results.
Researcher Affiliation Collaboration Xiang Long,1 Chuang Gan,1 Gerard de Melo,2 Xiao Liu,3 Yandong Li,3 Fu Li,3 Shilei Wen,3 1Tsinghua University , 2Rutgers University , 3Baidu IDL
Pseudocode No The paper describes mathematical equations for its models (e.g., LSTM equations 4-9), but it does not contain a structured pseudocode block or a clearly labeled algorithm figure.
Open Source Code No The paper does not provide an explicit statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets Yes We evaluate our approach on four popular video classification datasets. UCF101 (Soomro, Roshan Zamir, and Shah 2012) [...] Activity Net (Heilbron et al. 2015) [...] Kinetics (Carreira and Zisserman 2017) [...] You Tube-8M (Abu-El-Haija et al. 2016)
Dataset Splits Yes UCF101: Following the original evaluation scheme, we report the average accuracy over three training/testing splits. Activity Net: In the official split, the distribution among training, validation, and test data is about 50%, 25%, and 25% of the total videos, respectively. Kinetics: The dataset contains 246,535 training videos, 19,907 validation videos, and 38,685 test videos, covering 400 human action classes. You Tube-8M: In the official split, the distribution among training, validation, and test data is about 70%, 20%, and 10%, respectively.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions models and algorithms (e.g., 'Res Net-152', 'RMSPROP algorithm') but does not specify software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch versions).
Experiment Setup Yes For UCF101 and Activity Net, we extract both RGB and flow features using a Res Net-152 (He et al. 2016) model. For Kinetics, we extract RGB and flow features using Inception Res Net-v2 (Szegedy et al. 2016) and extract audio features with a VGG-16 (Simonyan and Zisserman 2014a). The number of segments we used for fine-tuning is 3 for UCF101, and 7 for Activity Net and Kinetics... We max-pool the frame-level features to 5 segment-level features for UCF101 and Kinetics... and 20 for Activity Net... For You Tube-8M... the maximum number of segment is 300. The number of hidden units for the LSTM on UCF101, Activity Net, and Kinetics is 512, while for You Tube-8M, we use 1024... with a learning rate of 0.0001.