reproducibilityindex.ai

No More Shortcuts: Realizing the Potential of Temporal Self-Supervision

Authors: Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments show state-of-the-art performance across 10 video understanding datasets, illustrating the generalization ability and robustness of our learned video representations. Project Page: https://daveishan.github.io/nms-webpage. ... We perform extensive ablations to verify our frame-wise pretext task formulation and illustrate the importance of shortcut removal for temporal self-supervision.
Researcher Affiliation	Collaboration	Ishan Rajendrakumar Dave1*, Simon Jenni2, Mubarak Shah1 1Center for Research in Computer Vision, University of Central Florida, USA 2Adobe Research, USA
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Project Page: https://daveishan.github.io/nms-webpage.
Open Datasets	Yes	We use the following set of established video benchmarks in our experiments: UCF101 (Soomro, Zamir, and Shah 2012)...HMDB51 (Kuehne et al. 2011)...Kinetics400 (Carreira and Zisserman 2017)...Something-Something V2 (SSv2) (Goyal et al. 2017)...NTU60 (Shahroudy et al. 2016)...Charades (Sigurdsson et al. 2018)...Holistic Video Understanding (HVU) (Diba et al. 2020)...DAVIS-2017 (Pont-Tuset et al. 2017)...CASIA-B (Yu, Tan, and Tan 2006)...JHMDB Pose (Jhuang et al. 2013)
Dataset Splits	Yes	We perform our self-supervised pertaining on unlabelled videos of Kinetics400. ... Following prior works (Han, Xie, and Zisserman 2020a; Dave et al. 2022; Diba et al. 2021), the test set of each dataset is used as a query-set, and the training set is considered as a search-set. ... The test set includes 50 subjects each with 10 sequences. The test set is divided into two splits: gallery and probe set
Hardware Specification	No	The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions software like 'Video Transformer Network (VTN) (Neimark et al. 2021)' and 'Vision Transformer (Vi T) (Dosovitskiy et al. 2020)' but does not specify version numbers for these or other software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	As inputs to our network, we feed 8 frames of resolution 224 x 224. During training, we use the common set of geometric augmentations (random crop, resize, flipping) and color jittering (random grayscale, color jittering, random erasing). ... Finally, we combine our frame-level temporal pretext tasks with the contrastive loss LC (LC1 + LC2). To summarize, we optimize LSSL = λOLOF L + λT LT SP + λCLC, where λO, λT , and λC are loss weights.