No More Shortcuts: Realizing the Potential of Temporal Self-Supervision
Authors: Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments show state-of-the-art performance across 10 video understanding datasets, illustrating the generalization ability and robustness of our learned video representations. Project Page: https://daveishan.github.io/nms-webpage. ... We perform extensive ablations to verify our frame-wise pretext task formulation and illustrate the importance of shortcut removal for temporal self-supervision. |
| Researcher Affiliation | Collaboration | Ishan Rajendrakumar Dave1*, Simon Jenni2, Mubarak Shah1 1Center for Research in Computer Vision, University of Central Florida, USA 2Adobe Research, USA |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Project Page: https://daveishan.github.io/nms-webpage. |
| Open Datasets | Yes | We use the following set of established video benchmarks in our experiments: UCF101 (Soomro, Zamir, and Shah 2012)...HMDB51 (Kuehne et al. 2011)...Kinetics400 (Carreira and Zisserman 2017)...Something-Something V2 (SSv2) (Goyal et al. 2017)...NTU60 (Shahroudy et al. 2016)...Charades (Sigurdsson et al. 2018)...Holistic Video Understanding (HVU) (Diba et al. 2020)...DAVIS-2017 (Pont-Tuset et al. 2017)...CASIA-B (Yu, Tan, and Tan 2006)...JHMDB Pose (Jhuang et al. 2013) |
| Dataset Splits | Yes | We perform our self-supervised pertaining on unlabelled videos of Kinetics400. ... Following prior works (Han, Xie, and Zisserman 2020a; Dave et al. 2022; Diba et al. 2021), the test set of each dataset is used as a query-set, and the training set is considered as a search-set. ... The test set includes 50 subjects each with 10 sequences. The test set is divided into two splits: gallery and probe set |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions software like 'Video Transformer Network (VTN) (Neimark et al. 2021)' and 'Vision Transformer (Vi T) (Dosovitskiy et al. 2020)' but does not specify version numbers for these or other software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | As inputs to our network, we feed 8 frames of resolution 224 x 224. During training, we use the common set of geometric augmentations (random crop, resize, flipping) and color jittering (random grayscale, color jittering, random erasing). ... Finally, we combine our frame-level temporal pretext tasks with the contrastive loss LC (LC1 + LC2). To summarize, we optimize LSSL = λOLOF L + λT LT SP + λCLC, where λO, λT , and λC are loss weights. |