reproducibilityindex.ai

Efficient Self-supervised Vision Transformers for Representation Learning

Authors: Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that combining the two techniques, Es Vi T achieves 81.3% top-1 accuracy on the Image Net linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classiﬁcation tasks, Es Vi T outperforms its supervised counterpart on 17 out of 18 datasets. ... 4 EXPERIMENTAL RESULTS
Researcher Affiliation	Industry	1Microsoft Research at Redmond, 2Microsoft Cloud + AI {chunyl,jianwyan,penzhan,xuga,bixi,xidai,luyuan,jfgao}@microsoft.com
Pseudocode	Yes	A.1 ALGORITHMS We summarize the training algorithm procedure of Es Vi T with LV +LR in Algorithm 1. To clearly outline the main idea of the algorithm, we show the algorithm for two augmented views.
Open Source Code	Yes	The code and pre-trained models are released at: https://github. com/microsoft/esvit
Open Datasets	Yes	We study unsupervised pre-training performed in Image Net-1K dataset (Deng et al., 2009) without labels.
Dataset Splits	Yes	We determine the L2 regularization strength λ using a hyperparameter sweep on the validation sets over the range between 10 6 and 106 , with 96 logarithmically spaced steps. To save compute required for the sweeps, we perform a parametric binary search that starts with λ = [10 6, 10 4, 10 2, 1, 102, 104, 106] and iteratively halves the interval around the peak until it reaches a resolution of 8 steps per decade. The hyperparameter sweeps are performed on a validation split of each dataset.
Hardware Specification	No	The paper mentions 'TPU years of training' and 'GPUs' (e.g., '625 TPU days, or 1.7 TPU years of training', '24 hours in 128 GPUs') but does not specify the exact models or configurations of these hardware components.
Software Dependencies	No	The paper mentions 'Py Torch-style pseudo-code', 'Adamw optimizer', and 'scikit-learn s L-BFGS implementation' but does not specify version numbers for these software components or libraries.
Experiment Setup	Yes	We train with the Adamw optimizer (Loshchilov & Hutter, 2018), a batch size of 512, and total epochs 300. Linear warmup of the learning rate is used during the ﬁrst 10 epochs, with its base value determined with the linear scaling rule (Goyal et al., 2017): lr = 0.0005 batchsize/256. After this warmup, the learning rate is decayed with a cosine schedule.