Task-Agnostic Self-Distillation for Few-Shot Action Recognition

Authors: Bin Zhang, Yuanjie Dang, Peng Chen, Ronghua Liang, Nan Gao, Ruohong Huan, Xiaofei He

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on standard datasets demonstrate our method s superior performance compared to existing state-of-the-art methods.
Researcher Affiliation Academia Bin Zhang1 , Yuanjie Dang1, , Peng Chen1 , Ronghua Liang1 , Nan Gao1 , Ruohong Huan1 and Xiaofei He2 1Zhejiang University of Technology 2Zhejiang University
Pseudocode No The paper describes its methods verbally and mathematically but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or a link to open-source code for the described methodology.
Open Datasets Yes We evaluate our approach on three standard datasets, including Kinetics[Carreira and Zisserman, 2017], UCF101[Soomro et al., 2012], and HMDB51[Kuehne et al., 2011]. The datasets are partitioned into meta-training, meta-validation, and meta-testing sets based on action categories to meet the requirements of the few-shot classification setting.
Dataset Splits Yes For Kinetics, we follow the splitting strategy proposed by[Zhu and Yang, 2018], selecting 100 action categories, each with 100 samples, and dividing these categories into 64, 12, and 24 for training, validation, and testing, respectively. For UCF101, we split it into 70, 10, and 21 categories for training, validation, and testing. In the case of HMDB51, we split it into 31, 10, and 10 categories for training, validation, and testing, adhering to the same splitting strategy as in[Zhang et al., 2020].
Hardware Specification Yes We implement our framework using Py Torch and conduct training on one RTX 4090 GPU.
Software Dependencies No The paper mentions implementing the framework using 'Py Torch' but does not specify a version number or other software dependencies with their versions.
Experiment Setup Yes Following the common paradigm of existing few-shot action recognition methods[Cao et al., 2020; Wang et al., 2023b], we employ Res Net50[He et al., 2016] as the backbone network and initialize it with weights pre-trained on Image Net[Deng et al., 2009] to extract frame-level features. We sparsely and uniformly sample 8 frames from each video, like previous methods[Cao et al., 2020; Wang et al., 2023b]. In the network architecture, the Transformer layers of the Encoder are configured with four layers, The teacher and student models adopt the same structure and initialization parameters. During training, we resize each frame in the video into 256 256, followed by random horizontal flips and random cropping to a 224 224 region. In the testing phase, we first perform resizing and then replace random cropping with center cropping to standardize the shape of input videos of varying sizes. We utilize the Adam optimizer with an initial learning rate of 0.0005 to train our model. We randomly sample 30,000 episodes from the meta-training set for training. For testing, similar to prior work[Wang et al., 2023b], we collect 10,000 episodes from the meta-testing set to evaluate the model s performance and report the average accuracy.