Self-supervised video pretraining yields robust and more human-aligned visual representations

Authors: Nikhil Parthasarathy, S. M. Ali Eslami, Joao Carreira, Olivier Henaff

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present in Table 1 the transfer performance of VITO compared to strong supervised and selfsupervised baselines on dense scene understanding (semantic segmentation and object detection), video understanding (video segmentation and action recognition), and out-of-distribution (OOD) object recognition.
Researcher Affiliation Collaboration Nikhil Parthasarathy S. M. Ali Eslami João Carreira Olivier J. Hénaff Google Deep Mind Current affiliation: NYU Center for Neural Science, work done while interning at Deep Mind.
Pseudocode No The paper describes the methodology using text and mathematical equations but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code No The paper does not contain an explicit statement or a direct link indicating that the source code for the VITO methodology described in this paper is publicly available.
Open Datasets Yes We present in Table 1 the transfer performance of VITO compared to strong supervised and selfsupervised baselines on dense scene understanding (semantic segmentation and object detection), video understanding (video segmentation and action recognition), and out-of-distribution (OOD) object recognition. (Table 1 lists: ADE20K, COCO, DAVIS, UCF101, IN-A, IN-Vid).
Dataset Splits Yes We fine-tune for 45 epochs on the PASCAL train_aug2012 set or 60 epochs on the ADE20K train set. We report m Io U on the val set averaged across 5 runs.
Hardware Specification Yes We pretrain Res Net-50 using the LARS optimizer [78] with a batch size of 4096 split across 128 Cloud TPU v3 workers.
Software Dependencies No The paper refers to codebases used (e.g., 'the mmseg codebase' and 'https://github.com/open-mmlab/mmsegmentation') but does not specify particular software versions for libraries or frameworks like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We pretrain Res Net-50 using the LARS optimizer [78] with a batch size of 4096 split across 128 Cloud TPU v3 workers. We adopt the optimization details of BYOL, scaling the learning rate linearly with the batch size and decaying it according to a cosine schedule. The base learning rate is 0.3 and the weight decay is 10^-6.