What Do Self-Supervised Vision Transformers Learn?

Authors: Namuk Park, Wonjae Kim, Byeongho Heo, Taekyung Kim, Sangdoo Yun

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a comparative study on how and why contrastive learning (CL) and masked image modeling (MIM) differ in their representations and in their performance of downstream tasks.
Researcher Affiliation Industry Namuk Park1 Wonjae Kim2 Byeongho Heo2 Taekyung Kim2 Sangdoo Yun2 1Prescient Design, Genentech 2NAVER AI Lab park.namuk@gene.com {wonjae.kim,bh.heo,taekyung.k,sangdoo.yun}@navercorp.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes The code for analysis is available at https://github.com/naver-ai/cl-vs-mim.
Open Datasets Yes Our analyses mainly compare Vi T-B/16 pre-trained on Image Net-1K (Russakovsky et al., 2015)
Dataset Splits Yes We use the Image Net validation images for our experiments. Table A.1: Training settings. batch size 1k training epoch 50
Hardware Specification Yes All experiments use {1, 4, 8} NVIDIA A100 Tensor Core GPU.
Software Dependencies No Neural network models are implemented in Py Torch (Paszke et al., 2019).
Experiment Setup Yes Table A.1: Training settings. optimizer sgd adamw adamw base learning rate 1.0e-0 1.25e-3 1.0e-4 weight decay 0.05 0.05 0.05 batch size 1k 2k 1k training epoch 50 100 100 learning rate schedule cosine cosine multistep warmup epoch 0 20 10 warmup schedule linear linear randaugment 9, 0.5 9, 0.5 label smoothing 0.1 0.1 mixup 0.8 0.8 cutmix 1.0 1.0 stochastic depth 0.1 0.1 layer decay 0.65 1.0 gradient clip 5.0 5.0