Frequency Selective Augmentation for Video Representation Learning

Authors: Jinhyung Kim, Taeoh Kim, Minho Shim, Dongyoon Han, Dongyoon Wee, Junmo Kim

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Transferring the improved representation to five video action recognition and two temporal action localization downstream tasks shows consistent improvements over baselines. Empirical results show that Freq Aug enhances the performance of multiple SSL frameworks and backbones, which implies the learned representation has significantly improved transferability. In Table 1, we present the evaluation results of MoCo with Freq Aug pretrained on MK200. We validate on four different backbones: Slow Only-50 (SO-50), Slow Only-18 (SO-18), R(2+1)D, and S3D-G, which have various input resolutions (number of frames T, stride τ), depth, and network architecture.
Researcher Affiliation Collaboration Jinhyung Kim1*, Taeoh Kim2, Minho Shim2, Dongyoon Han3, Dongyoon Wee2, Junmo Kim4 1 LG AI Research 2 NAVER CLOVA Video 3 NAVER AI Lab 4 KAIST
Pseudocode No The paper describes its method verbally and with diagrams, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing its source code, nor does it include a link to a code repository for the methodology described.
Open Datasets Yes For pretraining the model, we use Kinetics-400 (K400) (Carreira and Zisserman 2017) and Mini-Kinetics (MK200) (Xie et al. 2018). For evaluation of the pretrained models, we use five different action recognition datasets: UCF101 (Soomro, Zamir, and Shah 2012), HMDB51 (Kuehne et al. 2011), Diving48 (DV48) (Li, Li, and Vasconcelos 2018), Gym99 (Shao et al. 2020), and Something-something-v2 (SSv2) (Goyal et al. 2017). For temporal action localization, Breakfast (Kuehne, Arslan, and Serre 2014) and THUMOS 14 (Idrees et al. 2017) dataset are used.
Dataset Splits Yes For evaluation of the pretrained models, we use five different action recognition datasets: UCF101 (Soomro, Zamir, and Shah 2012), HMDB51 (Kuehne et al. 2011), Diving48 (DV48) (Li, Li, and Vasconcelos 2018), Gym99 (Shao et al. 2020), and Something-something-v2 (SSv2) (Goyal et 2017). We present split-1 accuracy for UCF101 and HMDB51 by default unless otherwise specified. The samples are from the validation set of MK200, and ftco=0.2 are set for both HPF and LPF.
Hardware Specification No The computational work in this study was mostly conducted on NAVER Smart Machine Learning (NSML) platform (Sung et al. 2017; Kim et al. 2018).
Software Dependencies No The paper mentions implementing Mo Co and BYOL and using Py SlowFast, but it does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes For self-supervised pretraining, all the models are trained with SGD for 200 epochs. Regarding spatial augmentation, augmentations described in (Chen et al. 2020b) are applied as our baseline. For temporal augmentation, randomly sampled clips from different timestamps compose the positive instances. Also, two clips are constrained to be sampled within a range of 1 second. Each clip consists of T frames sampled from T τ consecutive frames with the stride τ. In terms of Freq Aug, we use the following two default settings: 1) Freq Aug-T (temporal) uses temporal HPF with a cutoff frequency 0.1; 2) Freq Aug ST (spatio-temporal) is a combination of spatial HPF with a cutoff frequency 0.01 alongside with Freq Aug-T. We train the models for 200 epochs with the initial learning rate 0.025 without warm-up and zeroed weight decay for supervised finetuning and low-shot learning.