Attention in Convolutional LSTM for Gesture Recognition

Authors: Liang Zhang, Guangming Zhu, Lin Mei, Peiyi Shen, Syed Afaq Ali Shah, Mohammed Bennamoun

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The evaluation results demonstrate that the spatial convolutions in the three gates scarcely contribute to the spatiotemporal feature fusion, and the attention mechanisms embedded into the input and output gates cannot improve the feature fusion.
Researcher Affiliation Academia Liang Zhang Xidian University liangzhang@xidian.edu.cn Guangming Zhu Xidian University gmzhu@xidian.edu.cn Lin Mei Xidian University l_mei72@hotmail.com Peiyi Shen Xidian University pyshen@xidian.edu.cn Syed Afaq Ali Shah Central Queensland University afaq.shah@uwa.edu.au Mohammed Bennamoun University of Western Australia mohammed.bennamoun@uwa.edu.au
Pseudocode No The paper describes formulations using mathematical equations and figures, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code of the LSTM variants is publicly available2. 2https://github.com/GuangmingZhu/AttentionConvLSTM
Open Datasets Yes The proposed variants of Conv LSTM are evaluated on the large-scale isolated gesture datasets Jester [18] and Iso GD [19] in this paper. Jester[18] is a large collection of densely-labeled video clips. ... https://www.twentybn.com/datasets/jester, 2017. Iso GD[19] is a large-scale isolated gesture dataset which contains 47,933 RGB+D gesture videos of 249 kinds of gestures performed by 21 subjects. The dataset has been used in the 2016 [24] and 2017 [25] Cha Learn LAP Large-scale Isolated Gesture Recognition Challenges.
Dataset Splits No The evaluation on Jester has almost the same accuracy except for variant (b). The similar recognition results on Jester may be caused by the network capacity or the distinguishability of the data, because the validation has a comparable accuracy with the training. While validation is mentioned, the paper does not specify the explicit training/validation/test split percentages or sample counts for dataset partitioning.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions components like Res3D and Mobile Net, but does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes For the training on Jester, the learning rate follows a polynomial decay from 0.001 to 0.000001 within a total of 30 epochs. The input is 16 video clips, and each clip contains 16 frames with a spatial size of 112 112. ... During the fine-tuning with Iso GD, the batch size is set to 8, the temporal length is set to 32, and a total of 15 epochs are performed for each variant. The top-1 accuracy is used as the metric of evaluation. Stochastic gradient descent (SGD) is used for training. ... The filter numbers of Conv LSTM and the variants are all set to 256.