Learning Implicit Temporal Alignment for Few-shot Video Classification

Authors: Songyang Zhang, Jiale Zhou, Xuming He

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on two challenging benchmarks, show that our method outperforms the prior arts with a sizable margin on Something Something-V2 and competitive results on Kinetics. In this section, we conduct a series of experiments to validate the effectiveness of our method. Below we first give a brief introduction of experimental configurations and report the quantitative results on two benchmarks in Sec. 5.1. Then we conduct ablative experiments to show the efficacy of our model design in Sec. 5.2.
Researcher Affiliation Academia Songyang Zhang1,2,4, , Jiale Zhou1, , Xuming He1,3 1Shanghai Tech University 2 University of Chinese Academy of Sciences 3Shanghai Engineering Research Center of Intelligent Vision and Imaging 4Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences
Pseudocode No The paper describes the proposed methods using natural language and mathematical equations, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and model are available: https://github.com/tonysy/Py Action
Open Datasets Yes Following previous works, we use the Kinetics [Carreira and Zisserman, 2017b] and Something Something V2 [Goyal et al., 2017] as the benchmarks.
Dataset Splits Yes For the Kinetics dataset, we follow the same split as CMN [Zhu and Yang, 2018], which samples 64 classes for meta training, 12 classes for validation, and 24 classes for meta testing.
Hardware Specification No The paper describes the experimental setup and training procedures but does not provide specific details regarding the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper refers to using ResNet as an embedding network but does not list specific software dependencies with their version numbers (e.g., deep learning frameworks, libraries, or operating systems).
Experiment Setup Yes Experimental Configuration We follow the same video preprocessing procedure as OTAM [Cao et al., 2020]. During training, we first resize each frame in the video to 256 256 and then randomly crop a 224 224 region from the video clip. For the Something-Something V2 dataset, as pointed out in [Cao et al., 2020], the dataset is sensitive to concepts of left and right, hence we do not use horizontal flip for this dataset. Following the experiment settings and learning schedule from [Zhu and Yang, 2018] [Cao et al., 2020], we perform different C-way K-shot experiments on the two datasets, with 95% confidence interval in the meta-test phase. Specifically, the final results are reported over 5 runs and we randomly sample 20,000 episodes for each run.