Time Is MattEr: Temporal Self-supervision for Video Transformers

Authors: Sukmin Yun, Jaehyung Kim, Dongyoon Han, Hwanjun Song, Jung-Woo Ha, Jinwoo Shin

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the effectiveness of the proposed temporal self-supervised tasks, we incorporate our method with various Video Transformers and mainly evaluate on Something Something-v2 (SSv2) (Goyal et al., 2017) benchmark.
Researcher Affiliation Collaboration 1School of Electrical Engineering, KAIST, South Korea 2NAVER AI Lab, South Korea 3Graduate School of AI, KAIST, South Korea.
Pseudocode No The paper includes mathematical equations and descriptions of algorithms, but no explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code is available at https://github.com/alinlab/temporal-selfsupervision.
Open Datasets Yes We use SSv2 (Goyal et al., 2017) datasets and its temporal and static classes following the categorization of Sevilla-Lara et al. (2021)...
Dataset Splits Yes SSv2 is a challenging dataset that consists of 169k training videos and 25k validation videos over 174 classes; in particular, it contains a large proportion of temporal classes requiring temporal information to be recognized (Sevilla Lara et al., 2021).
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used for the experiments. It mentions the 'NAVER Smart Machine Learning (NSML) platform' but without hardware specifications.
Software Dependencies No The paper mentions 'Adamw optimizer' and 'Rand Augment' but does not specify version numbers for any software libraries or dependencies.
Experiment Setup Yes Specifically, we fine-tune all the models from Image Net (Deng et al., 2009) pre-trained weights of Vi T-B/16 (Dosovitskiy et al., 2021) for 35 training epochs with Adamw optimizer (Loshchilov & Hutter, 2018) and learning rate of 0.0001 and a batch size of 64. For data augmentation, we follow Rand Augment (Cubuk et al., 2020) policy of Patrick et al. (2021). We use the spatial resolution of 224 224 with patch size of 16 16, and eight frame input videos under the same 1 16 16 tokenization method, including Motionformer. We set all the loss weights to be 1 (i.e., λorder = λdebias = λflow = 1) unless stated otherwise.