Efficient Self-supervised Vision Transformers for Representation Learning

Authors: Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that combining the two techniques, Es Vi T achieves 81.3% top-1 accuracy on the Image Net linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, Es Vi T outperforms its supervised counterpart on 17 out of 18 datasets. ... 4 EXPERIMENTAL RESULTS
Researcher Affiliation Industry 1Microsoft Research at Redmond, 2Microsoft Cloud + AI {chunyl,jianwyan,penzhan,xuga,bixi,xidai,luyuan,jfgao}@microsoft.com
Pseudocode Yes A.1 ALGORITHMS We summarize the training algorithm procedure of Es Vi T with LV +LR in Algorithm 1. To clearly outline the main idea of the algorithm, we show the algorithm for two augmented views.
Open Source Code Yes The code and pre-trained models are released at: https://github. com/microsoft/esvit
Open Datasets Yes We study unsupervised pre-training performed in Image Net-1K dataset (Deng et al., 2009) without labels.
Dataset Splits Yes We determine the L2 regularization strength λ using a hyperparameter sweep on the validation sets over the range between 10 6 and 106 , with 96 logarithmically spaced steps. To save compute required for the sweeps, we perform a parametric binary search that starts with λ = [10 6, 10 4, 10 2, 1, 102, 104, 106] and iteratively halves the interval around the peak until it reaches a resolution of 8 steps per decade. The hyperparameter sweeps are performed on a validation split of each dataset.
Hardware Specification No The paper mentions 'TPU years of training' and 'GPUs' (e.g., '625 TPU days, or 1.7 TPU years of training', '24 hours in 128 GPUs') but does not specify the exact models or configurations of these hardware components.
Software Dependencies No The paper mentions 'Py Torch-style pseudo-code', 'Adamw optimizer', and 'scikit-learn s L-BFGS implementation' but does not specify version numbers for these software components or libraries.
Experiment Setup Yes We train with the Adamw optimizer (Loshchilov & Hutter, 2018), a batch size of 512, and total epochs 300. Linear warmup of the learning rate is used during the first 10 epochs, with its base value determined with the linear scaling rule (Goyal et al., 2017): lr = 0.0005 batchsize/256. After this warmup, the learning rate is decayed with a cosine schedule.