No More Shortcuts: Realizing the Potential of Temporal Self-Supervision

Authors: Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments show state-of-the-art performance across 10 video understanding datasets, illustrating the generalization ability and robustness of our learned video representations. Project Page: https://daveishan.github.io/nms-webpage. ... We perform extensive ablations to verify our frame-wise pretext task formulation and illustrate the importance of shortcut removal for temporal self-supervision.
Researcher Affiliation Collaboration Ishan Rajendrakumar Dave1*, Simon Jenni2, Mubarak Shah1 1Center for Research in Computer Vision, University of Central Florida, USA 2Adobe Research, USA
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Project Page: https://daveishan.github.io/nms-webpage.
Open Datasets Yes We use the following set of established video benchmarks in our experiments: UCF101 (Soomro, Zamir, and Shah 2012)...HMDB51 (Kuehne et al. 2011)...Kinetics400 (Carreira and Zisserman 2017)...Something-Something V2 (SSv2) (Goyal et al. 2017)...NTU60 (Shahroudy et al. 2016)...Charades (Sigurdsson et al. 2018)...Holistic Video Understanding (HVU) (Diba et al. 2020)...DAVIS-2017 (Pont-Tuset et al. 2017)...CASIA-B (Yu, Tan, and Tan 2006)...JHMDB Pose (Jhuang et al. 2013)
Dataset Splits Yes We perform our self-supervised pertaining on unlabelled videos of Kinetics400. ... Following prior works (Han, Xie, and Zisserman 2020a; Dave et al. 2022; Diba et al. 2021), the test set of each dataset is used as a query-set, and the training set is considered as a search-set. ... The test set includes 50 subjects each with 10 sequences. The test set is divided into two splits: gallery and probe set
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions software like 'Video Transformer Network (VTN) (Neimark et al. 2021)' and 'Vision Transformer (Vi T) (Dosovitskiy et al. 2020)' but does not specify version numbers for these or other software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes As inputs to our network, we feed 8 frames of resolution 224 x 224. During training, we use the common set of geometric augmentations (random crop, resize, flipping) and color jittering (random grayscale, color jittering, random erasing). ... Finally, we combine our frame-level temporal pretext tasks with the contrastive loss LC (LC1 + LC2). To summarize, we optimize LSSL = λOLOF L + λT LT SP + λCLC, where λO, λT , and λC are loss weights.