Image BERT Pre-training with Online Tokenizer

Authors: Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the prominence of i BOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on Image Net1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, e.g., object detection, instance segmentation, and semantic segmentation.
Researcher Affiliation Collaboration Jinghao Zhou1 Chen Wei2 Huiyu Wang2 Wei Shen3 Cihang Xie4 Alan Yuille2 Tao Kong1 1Byte Dance 2Johns Hopkins University 3Shanghai Jiao Tong University 4UC Santa Cruz
Pseudocode Yes A PSEUDOCODE Algorithm 1: i BOT Py Torch-like Pseudocode w/o multi-crop augmentation
Open Source Code Yes The code and models are publicly available at https://github.com/bytedance/ibot.
Open Datasets Yes We pre-train i BOT on Image Net-1K (Deng et al., 2009) training set with Adam W (Loshchilov & Hutter, 2019) optimizer and a batch size of 1024. We also pre-train on Image Net-22K training set with Vi T-B/16 for 80 epochs and Vi T-L/16 for 50 epochs.
Dataset Splits Yes For k-NN evaluation, we sweep over different numbers of nearest neighbors. For linear evaluation, we sweep over different learning rates.
Hardware Specification Yes All methods are trained on two 8-GPU V100 machines with a batch size of 1024.
Software Dependencies No The paper mentions 'Adam W (Loshchilov & Hutter, 2019) optimizer' but does not specify software versions for any libraries, frameworks, or languages used.
Experiment Setup Yes We use the Vision Transformers (Dosovitskiy et al., 2021) and Swin Transformers (Liu et al., 2021b) with different amounts of parameters, Vi T-S/16, Vi T-B/16, Vi T-L/16, and Swin T/{7,14} as the backbone f. ... We set the output dimension of the shared head to 8192. ... We by default pre-train i BOT on Image Net-1K (Deng et al., 2009) training set with Adam W (Loshchilov & Hutter, 2019) optimizer and a batch size of 1024. We pre-train i BOT with Vi T-S/16 for 800 epochs, Vi T-B/16 for 400 epochs, Vi T-L/16 for 250 epochs, and Swin T/{7,14} for 300 epochs. ... The learning rate is linearly ramped up during the first 10 epochs to its base value scaled with the total batch size: lr = 5e 4 batch size/256. We use random MIM, with prediction ratio r set as 0 with a probability of 0.5 and uniformly sampled from range [0.1, 0.5] with a probability of 0.5.