Multimodal Keyless Attention Fusion for Video Classification
Authors: Xiang Long, Chuang Gan, Gerard Melo, Xiao Liu, Yandong Li, Fu Li, Shilei Wen
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment on four highly heterogeneous datasets, UCF101, Activity Net, Kinetics, and You Tube-8M to validate our conclusion, and show that our approach achieves highly competitive results. |
| Researcher Affiliation | Collaboration | Xiang Long,1 Chuang Gan,1 Gerard de Melo,2 Xiao Liu,3 Yandong Li,3 Fu Li,3 Shilei Wen,3 1Tsinghua University , 2Rutgers University , 3Baidu IDL |
| Pseudocode | No | The paper describes mathematical equations for its models (e.g., LSTM equations 4-9), but it does not contain a structured pseudocode block or a clearly labeled algorithm figure. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing the source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We evaluate our approach on four popular video classification datasets. UCF101 (Soomro, Roshan Zamir, and Shah 2012) [...] Activity Net (Heilbron et al. 2015) [...] Kinetics (Carreira and Zisserman 2017) [...] You Tube-8M (Abu-El-Haija et al. 2016) |
| Dataset Splits | Yes | UCF101: Following the original evaluation scheme, we report the average accuracy over three training/testing splits. Activity Net: In the official split, the distribution among training, validation, and test data is about 50%, 25%, and 25% of the total videos, respectively. Kinetics: The dataset contains 246,535 training videos, 19,907 validation videos, and 38,685 test videos, covering 400 human action classes. You Tube-8M: In the official split, the distribution among training, validation, and test data is about 70%, 20%, and 10%, respectively. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions models and algorithms (e.g., 'Res Net-152', 'RMSPROP algorithm') but does not specify software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch versions). |
| Experiment Setup | Yes | For UCF101 and Activity Net, we extract both RGB and flow features using a Res Net-152 (He et al. 2016) model. For Kinetics, we extract RGB and flow features using Inception Res Net-v2 (Szegedy et al. 2016) and extract audio features with a VGG-16 (Simonyan and Zisserman 2014a). The number of segments we used for fine-tuning is 3 for UCF101, and 7 for Activity Net and Kinetics... We max-pool the frame-level features to 5 segment-level features for UCF101 and Kinetics... and 20 for Activity Net... For You Tube-8M... the maximum number of segment is 300. The number of hidden units for the LSTM on UCF101, Activity Net, and Kinetics is 512, while for You Tube-8M, we use 1024... with a learning rate of 0.0001. |