Attentional Pooling for Action Recognition

Authors: Rohit Girdhar, Deva Ramanan

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment with three recent, large scale action recognition datasets, across still images and videos, namely MPII, HICO and HMDB51. MPII Human Pose Dataset [34] contains 15205 images labeled with up to 16 human body keypoints, and classified into one of 393 action classes. It is split into train, val (from authors of [18]) and test sets, with 8218, 6987 and 5708 images each. We use the val set to compare with [18] and for ablative analysis while the final test results are obtained by emailing our results to authors of [34]. The dataset is highly imbalanced and the evaluation is performed using mean average precision (m AP) to equally weight all classes. HICO [7] is a recently introduced dataset with labels for 600 human object interactions (HOI) combining 117 actions with 80 objects. It contains 38116 training and 9658 test images, with each image labeled with all the HOIs active for that image (multi-label setting). Like MPII, this dataset is also highly unbalanced and evaluation is performed using m AP over classes. Finally, to verify our method s applicability to video based action recognition, we experiment with a challenging trimmed action classification dataset, HMDB51 [27]. It contains 6766 realistic and varied video clips from 51 action classes. Evaluation is performed using average classification accuracy over three train/test splits from [23], each with 3570 train and 1530 test videos. Baselines: Throughout the following sections, we compare our approach first to the standard base architecture, mostly Res Net-101 [20], without the attention-weighted pooling. Then we compare to other reported methods and previous state of the art on the respective datasets. MPII: We train our models for 393-way action classification on MPII with softmax cross-entropy loss for both the baseline Res Net and our attentional model. We compare our performance in Tab. 1. Our unconstrained attention model clearly out-performs the base Res Net model, as well as previous state of the art methods involving detection of multiple contextual bounding boxes [18] and fusion of full image with human bounding box features [30]. Our pose-regularized model performs best, though the improvement is small. We visualize the attention maps learned in Fig. 2.
Researcher Affiliation Academia Rohit Girdhar Deva Ramanan The Robotics Institute, Carnegie Mellon University
Pseudocode No The paper describes the method using mathematical derivations and network diagrams, but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper references an existing implementation of 'Compact bilinear pooling' from a GitHub link ('https://github.com/ronghanghu/tensorflow_ compact_bilinear_pooling'), but it does not state that the authors are releasing their own source code for the methodology described in this paper.
Open Datasets Yes We experiment with three recent, large scale action recognition datasets, across still images and videos, namely MPII, HICO and HMDB51. MPII Human Pose Dataset [34] contains 15205 images labeled with up to 16 human body keypoints, and classified into one of 393 action classes. HICO [7] is a recently introduced dataset with labels for 600 human object interactions (HOI) combining 117 actions with 80 objects. Finally, to verify our method s applicability to video based action recognition, we experiment with a challenging trimmed action classification dataset, HMDB51 [27]. It contains 6766 realistic and varied video clips from 51 action classes.
Dataset Splits Yes MPII Human Pose Dataset [34] contains 15205 images labeled with up to 16 human body keypoints, and classified into one of 393 action classes. It is split into train, val (from authors of [18]) and test sets, with 8218, 6987 and 5708 images each. HICO [7] is a recently introduced dataset with labels for 600 human object interactions (HOI) combining 117 actions with 80 objects. It contains 38116 training and 9658 test images. HMDB51 [27] ... Evaluation is performed using average classification accuracy over three train/test splits from [23], each with 3570 train and 1530 test videos.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used in the experiments (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup No The paper mentions general training settings such as 'softmax cross-entropy loss' and 're-sizing input frames to 450px' for certain datasets, but it does not provide specific experimental setup details like learning rates, batch sizes, optimizer types, or training epochs.