Efficient Self-supervised Vision Transformers for Representation Learning
Authors: Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that combining the two techniques, Es Vi T achieves 81.3% top-1 accuracy on the Image Net linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, Es Vi T outperforms its supervised counterpart on 17 out of 18 datasets. ... 4 EXPERIMENTAL RESULTS |
| Researcher Affiliation | Industry | 1Microsoft Research at Redmond, 2Microsoft Cloud + AI {chunyl,jianwyan,penzhan,xuga,bixi,xidai,luyuan,jfgao}@microsoft.com |
| Pseudocode | Yes | A.1 ALGORITHMS We summarize the training algorithm procedure of Es Vi T with LV +LR in Algorithm 1. To clearly outline the main idea of the algorithm, we show the algorithm for two augmented views. |
| Open Source Code | Yes | The code and pre-trained models are released at: https://github. com/microsoft/esvit |
| Open Datasets | Yes | We study unsupervised pre-training performed in Image Net-1K dataset (Deng et al., 2009) without labels. |
| Dataset Splits | Yes | We determine the L2 regularization strength λ using a hyperparameter sweep on the validation sets over the range between 10 6 and 106 , with 96 logarithmically spaced steps. To save compute required for the sweeps, we perform a parametric binary search that starts with λ = [10 6, 10 4, 10 2, 1, 102, 104, 106] and iteratively halves the interval around the peak until it reaches a resolution of 8 steps per decade. The hyperparameter sweeps are performed on a validation split of each dataset. |
| Hardware Specification | No | The paper mentions 'TPU years of training' and 'GPUs' (e.g., '625 TPU days, or 1.7 TPU years of training', '24 hours in 128 GPUs') but does not specify the exact models or configurations of these hardware components. |
| Software Dependencies | No | The paper mentions 'Py Torch-style pseudo-code', 'Adamw optimizer', and 'scikit-learn s L-BFGS implementation' but does not specify version numbers for these software components or libraries. |
| Experiment Setup | Yes | We train with the Adamw optimizer (Loshchilov & Hutter, 2018), a batch size of 512, and total epochs 300. Linear warmup of the learning rate is used during the first 10 epochs, with its base value determined with the linear scaling rule (Goyal et al., 2017): lr = 0.0005 batchsize/256. After this warmup, the learning rate is decayed with a cosine schedule. |