Revisiting the Spatial and Temporal Modeling for Few-Shot Action Recognition

Authors: Jiazheng Xing, Mengmeng Wang, Yong Liu, Boyu Mu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively validate the proposed Slosh Net on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets. The extensive experiments on four widely-used datasets (Something-Something V2, SSV2 (Goyal et al. 2017), Kinetics (Carreira and Zisserman 2017), UCF101 (Soomro, Zamir, and Shah 2012), and HMDB51 (Kuehne et al. 2011)) demonstrate the effectiveness of our methods.
Researcher Affiliation Academia Jiazheng Xing, Mengmeng Wang*, Yong Liu , Boyu Mu Zhejiang University, Hangzhou, China {jiazhengxing,mengmengwang, muboyu}@zju.edu.cn, yongliu@iipc.zju.edu.cn
Pseudocode No The paper describes methodologies using text, equations, and diagrams, but does not include structured pseudocode blocks or algorithms labeled as such.
Open Source Code No The paper does not include any explicit statement about releasing source code or provide a link to a code repository for the described methodology.
Open Datasets Yes We extensively validate the proposed Slosh Net on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. The extensive experiments on four widely-used datasets (Something-Something V2, SSV2 (Goyal et al. 2017), Kinetics (Carreira and Zisserman 2017), UCF101 (Soomro, Zamir, and Shah 2012), and HMDB51 (Kuehne et al. 2011)) demonstrate the effectiveness of our methods.
Dataset Splits No The paper describes sampling training episodes and testing on tasks from test sets, but it does not specify a distinct 'validation' dataset split for general model tuning or early stopping, which is common in a standard train/validation/test split for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions using ResNet-50 as a feature extractor and the SGD optimizer, but it does not specify software versions for libraries, frameworks, or programming languages (e.g., Python version, PyTorch version).
Experiment Setup Yes Network Architectures: We use the Res Net-50 as the feature extractor with Image Net pre-trained weights (Deng et al. 2009). In FFAS, we automatically search for the best combination of the four layers in Res Net-50 and the weights of the three optional operations are initialized equally in each layer. We use the 3 3 convolution layer as the Modulealign, the spatial self-attention as the ffuse in FFAS and two layers multi-head attention as the Moduleatt in LTMM. r in STMM is set to 16. The initial weight of the learnable parameter A and λ is set to [0.1, 0.1, 0.1, 0.1] and 0.5, respectively. In frame-level class prototype matcher, we set D = 1152, Ω= {1} for spatial-related datasets, and Ω= {1, 2} for temporal-related dataset. Training and Inference: We uniformly sampled 8 frames (l=8) of a video as the input augmented with random horizontal flipping and 224 224 crops in training, while only a center crop in inference. For training, SSV2 were randomly sampled 100,000 training episodes with an initial learning rate of 10 4, and the other datasets were randomly sampled 10,000 training episodes with an initial learning rate of 10 3. Moreover, we used the SGD optimizer with the multi-step scheduler for our framework.