StNet: Local and Global Spatial-Temporal Modeling for Action Recognition
Authors: Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang, Shilei Wen8401-8408
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. |
| Researcher Affiliation | Collaboration | Department of Computer Vision Technology (VIS), Baidu Inc.1 MIT-IBM Watson AI Lab2, University of Central Florida3 State Key Lab for Novel Software Technology, Nanjing University, China 4 |
| Pseudocode | No | The paper describes the architecture and components with diagrams and text, but it does not include a specific pseudocode block or algorithm listing. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | To evaluate the performance of our proposed St Net framework for large scale video-based action recognition, we perform extensive experiments on the recent large scale action recognition dataset named Kinetics (Kay et al. 2017). ...To validate that the effectiveness of St Net could be transferred to other datasets, we conduct transfer learning experiments on the UCF101 (Soomro, Zamir, and Shah 2012)... |
| Dataset Splits | Yes | The validation set of Kinetics400 consists of about 20K video clips. The second version of Kinetics (denoted as Kinetics600) contains 600 action categories and there are about 400K trimmed video clips in its training set and 30K clips in the validation set. ... The labeled video clips are divided into three training/testing splits for evaluation. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types) for running its experiments. It mentions using '2D Conv Net frameworks' but no hardware specifics. |
| Software Dependencies | No | The paper mentions software components like "Conv3d-BN3d-ReLU" and "SGD" but does not provide specific version numbers for any software, libraries, or frameworks used (e.g., TensorFlow, PyTorch, CUDA versions). |
| Experiment Setup | Yes | In our current setting, N is set to 5. ... In this experiment, T is set to 7 and N to 5. The video frames are scaled such that their short-size is 331 and a random and the central 299 299 patch is cropped from each of the T frames in the training phase and testing phase, respectively. ... In the temporal modeling blocks, weights of Conv3d layers are initially set to 1/(3 Ci), where Ci denotes input channel size, and biases are set to 0. |