Uncovering the Hidden Dynamics of Video Self-supervised Learning under Distribution Shifts
Authors: Pritam Sarkar, Ahmad Beirami, Ali Etemad
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we comprehensively study the behavior of six popular self-supervised methods (v-Sim CLR, v-Mo Co, v-BYOL, v-Sim Siam, v-DINO, v-MAE) in response to various forms of natural distribution shift... To perform this extensive study, we carefully craft a test bed consisting of 17 in-distribution and out-of-distribution benchmark pairs using available public datasets and a series of evaluation protocols to stress-test the different methods under the intended shifts. Our study uncovers a series of intriguing findings and interesting behaviors of VSSL methods. |
| Researcher Affiliation | Collaboration | Pritam Sarkar1,2 Ahmad Beirami3 Ali Etemad1,3 1Queen s University 2Vector Institute 3Google Research |
| Pseudocode | No | Not found. The paper describes the methods in text format, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or structured code-like procedures. |
| Open Source Code | Yes | The project page and code are available at: https://pritamqu.github.io/OOD-VSSL. |
| Open Datasets | Yes | We use two large-scale video action datasets, Kinetics400 [61] and Kinetics700 [62], for pretraining the video models. The results presented in the main paper use Kinetics400 for pre-training, while we provide additional results based on pretraining with Kinetics700 in Appendix D. To evaluate the video models under distribution shifts, we use a total of 12 real-world benchmarks, comprising: Mimetics10 and Mimetics50 [28] as the out-of-context validation sets; Charades Ego [36], Tiny Virat-v2 [63], and Sims4Action [64] to investigate viewpoint shifts (egocentric, surveillance camera, and top-down views, respectively); Actor Shift [65] and Sims4action [64], for actor shifts (animal and synthetic domains, respectively); UCF101 [66] and HMDB51 [67] for source shift; UCF101 [66], HMDB51 [67], and Rare Act [68] for zero-shot recognition; UCF101 and HMDB51 for open-set recognition while using Kinetics400 and UCF101 as closed-set. For each Oo D validation set, we create an In D training and validation set to measure the change in performance. We construct the In D splits using Kinetics400 [61], Kinetics700 [62], Mi T-v2 [69], and Charades Ego [36]. Finally, we also use 3 toy datasets to conduct experiments in controlled setups, including Toy Box [70], COIL [34], STL-10 [71]. Additional details of the benchmarks can be found in Appendix C. |
| Dataset Splits | Yes | To study input-based distribution shifts, we perform linear evaluation and finetuning using the In D training splits, followed by evaluating on both In D and Oo D validation splits. To perform this investigation, we design a comprehensive study consisting of 17 in-distribution and out-of-distribution (In D-Oo D) dataset pairs and examine the dynamics of VSSL under different distribution shifts using a variety of evaluation protocols including linear evaluation, finetuning, unsupervised clustering, and zero-shot recognition. Table S5: An overview of our out-of-distribution test-bed. #Samples are in following order training samples/In D test samples/Oo D test samples; In zero-shot and open-set recognition, #Classes indicates the number of In D/Oo D classes and for others the number of In D and Oo D classes remains the same. |
| Hardware Specification | Yes | All the methods are pretrained using 8 V100 32 GB GPUs in parallel. |
| Software Dependencies | No | Not found. The paper mentions using the Adam W optimizer [98, 99], but it does not specify version numbers for programming languages (e.g., Python) or major deep learning frameworks (e.g., PyTorch, TensorFlow) or any other specific software libraries required to replicate the experiments. |
| Experiment Setup | Yes | To ensure a fair comparison between the VSSL methods, we pretrain them in identical setups with necessary adjustments in hyperparameters. Specifically, we keep the encoder, inputs, batch size, and maximum number of pretraining epochs identical for all the methods. We use Adam W [98, 99] optimizer with cosine learning rate scheduler and train all methods up to 800 epochs with a batch size of 768. We present the hyperparameters in Table S3. |