Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Authors: Sijie Yan, Yuanjun Xiong, Dahua Lin

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we evaluate the performance of ST-GCN in skeleton based action recognition experiments. We experiment on two large-scale action recognition datasets with vastly different properties: Kinetics human action dataset (Kinetics) (Kay et al. 2017) is by far the largest unconstrained action recognition dataset, and NTURGB+D (Shahroudy et al. 2016) the largest in-house captured action recognition dataset. In particular, we first perform detailed ablation study on the Kinetics dataset to examine the contributions of the proposed model components to the recognition performance. Then we compare the recognition results of ST-GCN with other state-of-the-art methods and other input modalities.
Researcher Affiliation Academia Sijie Yan, Yuanjun Xiong, Dahua Lin Department of Information Engineering, The Chinese University of Hong Kong {ys016, dhlin}@ie.cuhk.edu.hk, bitxiong@gmail.com
Pseudocode No No explicit pseudocode or algorithm blocks are provided in the paper.
Open Source Code Yes The code and models of ST-GCN are made publicly available1. 1https://github.com/yysijie/st-gcn
Open Datasets Yes Deepmind Kinetics human action dataset (Kay et al. 2017) ... NTU-RGB+D (Shahroudy et al. 2016) is currently the largest dataset with 3D joints annotations for human action recognition task.
Dataset Splits Yes The Kinetics dataset provides a training set of 240, 000 clips and a validation set of 20, 000. ... NTU-RGB+D: ... 1) cross-subject (X-Sub) benchmark with 39, 889 and 16, 390 clips for training and evaluation. ... 2) crossview(X-View) benchmark 37, 462 and 18, 817 clips.
Hardware Specification Yes All experiments were conducted on Pytorch deep learning framework with 8 TITANX GPUs.
Software Dependencies No The paper mentions "Pytorch deep learning framework" but does not specify its version or any other software dependencies with version numbers.
Experiment Setup Yes The ST-GCN model is composed of 9 layers of spatial temporal graph convolution operators (ST-GCN units). The first three layers have 64 channels for output. The follow three layers have 128 channels for output. And the last three layers have 256 channels for output. These layers have 9 temporal kernel size. The Resnet mechanism is applied on each ST-GCN unit. And we randomly dropout the features at 0.5 probability after each STGCN unit to avoid overfitting. The strides of the 4-th and the 7-th temporal convolution layers are set to 2 as pooling layer. After that, a global pooling was performed on the resulting tensor to get a 256 dimension feature vector for each sequence. Finally, we feed them to a Soft Max classifier. The models are learned using stochastic gradient descent with a learning rate of 0.01. We decay the learning rate by 0.1 after every 10 epochs. To avoid overfitting, we perform two kinds of augmentation when training on the Kinetics dataset (Kay et al. 2017). First, to simulate the camera movement, we perform random affine transformations on the skeleton sequences of all frames. Particularly, from the first frame to the last frame, we select a few fixed angle, translation and scaling factors as candidates and then randomly sampled two combinations of three factors to generate an affine transformation. This transformation is interpolated for intermediate frames to generate a effect as if we smoothly move the view point during playback. We name this augmentation as random moving. Second, we randomly sample fragments from the original skeleton sequences in training and use all frames in the test. Global pooling at the top of the network enables the network to handle the input sequences with indefinite length.