Multi-Scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition

Authors: Haoze Wu, Jiawei Liu, Xierong Zhu, Meng Wang, Zheng-Jun Zha

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results show that our MSTI-Net significantly boosts the performance of existing convolution networks and achieves stateof-the-art accuracy on three challenging benchmarks, i.e., UCF-101, HMDB-51 and Kinetics-400, with much fewer parameters and FLOPs.
Researcher Affiliation Academia Haoze Wu1 , Jiawei Liu1 , Xierong Zhu1 , Meng Wang2 and Zheng-Jun Zha1 1University of Science and Technology of China 2Hefei University of Technology
Pseudocode No The paper provides architectural diagrams and mathematical formulas but no pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets Yes We use three widely-used and challenging benchmarks, i.e. Kinetics-400 [Kay et al., 2017], UCF-101 [Soomro et al., 2012], and HMDB-51 [Kuehne et al., 2013] in the experiments.
Dataset Splits No Both UCF-101 and HMDB-51 consists of three training/test splits provided by the datasets organizers. The paper explicitly mentions 'training/test splits' but does not explicitly specify a 'validation' split or its size/methodology.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions 'Adam Gradient Descent optimizer' but does not specify software names with version numbers for libraries, frameworks, or other dependencies.
Experiment Setup Yes Our data augmentation includes random clipping on both spatial dimension (by firstly resizing the smaller video side to 256 pixels, then randomly cropping a 224 224 patch) and temporal dimension (by randomly picking the starting frame among those early enough to guarantee a desired number of frames). We use the Adam Gradient Descent optimizer with an initial learning rate of 1e 4 to train the MSTI-related networks from scratch. The drop out ratio is set to 0.5 and the weight decay rate is set to 5e 5. The gradient descent optimizer has the 1e 5 initial learning rate, and it is adopted with a momentum of 0.9 to train our MSTI-Net initialized with the Kinetics-400 and Image Net-1k pre-trained model. To prevent over-fitting, we further employ a higher drop out ratio of 0.9 and a weight decay rate of 5e 4.