reproducibilityindex.ai

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Authors: Sijie Yan, Yuanjun Xiong, Dahua Lin

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we evaluate the performance of ST-GCN in skeleton based action recognition experiments. We experiment on two large-scale action recognition datasets with vastly different properties: Kinetics human action dataset (Kinetics) (Kay et al. 2017) is by far the largest unconstrained action recognition dataset, and NTURGB+D (Shahroudy et al. 2016) the largest in-house captured action recognition dataset. In particular, we ﬁrst perform detailed ablation study on the Kinetics dataset to examine the contributions of the proposed model components to the recognition performance. Then we compare the recognition results of ST-GCN with other state-of-the-art methods and other input modalities.
Researcher Affiliation	Academia	Sijie Yan, Yuanjun Xiong, Dahua Lin Department of Information Engineering, The Chinese University of Hong Kong {ys016, dhlin}@ie.cuhk.edu.hk, bitxiong@gmail.com
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided in the paper.
Open Source Code	Yes	The code and models of ST-GCN are made publicly available1. 1https://github.com/yysijie/st-gcn
Open Datasets	Yes	Deepmind Kinetics human action dataset (Kay et al. 2017) ... NTU-RGB+D (Shahroudy et al. 2016) is currently the largest dataset with 3D joints annotations for human action recognition task.
Dataset Splits	Yes	The Kinetics dataset provides a training set of 240, 000 clips and a validation set of 20, 000. ... NTU-RGB+D: ... 1) cross-subject (X-Sub) benchmark with 39, 889 and 16, 390 clips for training and evaluation. ... 2) crossview(X-View) benchmark 37, 462 and 18, 817 clips.
Hardware Specification	Yes	All experiments were conducted on Pytorch deep learning framework with 8 TITANX GPUs.
Software Dependencies	No	The paper mentions "Pytorch deep learning framework" but does not specify its version or any other software dependencies with version numbers.
Experiment Setup	Yes	The ST-GCN model is composed of 9 layers of spatial temporal graph convolution operators (ST-GCN units). The ﬁrst three layers have 64 channels for output. The follow three layers have 128 channels for output. And the last three layers have 256 channels for output. These layers have 9 temporal kernel size. The Resnet mechanism is applied on each ST-GCN unit. And we randomly dropout the features at 0.5 probability after each STGCN unit to avoid overﬁtting. The strides of the 4-th and the 7-th temporal convolution layers are set to 2 as pooling layer. After that, a global pooling was performed on the resulting tensor to get a 256 dimension feature vector for each sequence. Finally, we feed them to a Soft Max classiﬁer. The models are learned using stochastic gradient descent with a learning rate of 0.01. We decay the learning rate by 0.1 after every 10 epochs. To avoid overﬁtting, we perform two kinds of augmentation when training on the Kinetics dataset (Kay et al. 2017). First, to simulate the camera movement, we perform random afﬁne transformations on the skeleton sequences of all frames. Particularly, from the ﬁrst frame to the last frame, we select a few ﬁxed angle, translation and scaling factors as candidates and then randomly sampled two combinations of three factors to generate an afﬁne transformation. This transformation is interpolated for intermediate frames to generate a effect as if we smoothly move the view point during playback. We name this augmentation as random moving. Second, we randomly sample fragments from the original skeleton sequences in training and use all frames in the test. Global pooling at the top of the network enables the network to handle the input sequences with indeﬁnite length.