Image BERT Pre-training with Online Tokenizer
Authors: Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show the prominence of i BOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on Image Net1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, e.g., object detection, instance segmentation, and semantic segmentation. |
| Researcher Affiliation | Collaboration | Jinghao Zhou1 Chen Wei2 Huiyu Wang2 Wei Shen3 Cihang Xie4 Alan Yuille2 Tao Kong1 1Byte Dance 2Johns Hopkins University 3Shanghai Jiao Tong University 4UC Santa Cruz |
| Pseudocode | Yes | A PSEUDOCODE Algorithm 1: i BOT Py Torch-like Pseudocode w/o multi-crop augmentation |
| Open Source Code | Yes | The code and models are publicly available at https://github.com/bytedance/ibot. |
| Open Datasets | Yes | We pre-train i BOT on Image Net-1K (Deng et al., 2009) training set with Adam W (Loshchilov & Hutter, 2019) optimizer and a batch size of 1024. We also pre-train on Image Net-22K training set with Vi T-B/16 for 80 epochs and Vi T-L/16 for 50 epochs. |
| Dataset Splits | Yes | For k-NN evaluation, we sweep over different numbers of nearest neighbors. For linear evaluation, we sweep over different learning rates. |
| Hardware Specification | Yes | All methods are trained on two 8-GPU V100 machines with a batch size of 1024. |
| Software Dependencies | No | The paper mentions 'Adam W (Loshchilov & Hutter, 2019) optimizer' but does not specify software versions for any libraries, frameworks, or languages used. |
| Experiment Setup | Yes | We use the Vision Transformers (Dosovitskiy et al., 2021) and Swin Transformers (Liu et al., 2021b) with different amounts of parameters, Vi T-S/16, Vi T-B/16, Vi T-L/16, and Swin T/{7,14} as the backbone f. ... We set the output dimension of the shared head to 8192. ... We by default pre-train i BOT on Image Net-1K (Deng et al., 2009) training set with Adam W (Loshchilov & Hutter, 2019) optimizer and a batch size of 1024. We pre-train i BOT with Vi T-S/16 for 800 epochs, Vi T-B/16 for 400 epochs, Vi T-L/16 for 250 epochs, and Swin T/{7,14} for 300 epochs. ... The learning rate is linearly ramped up during the first 10 epochs to its base value scaled with the total batch size: lr = 5e 4 batch size/256. We use random MIM, with prediction ratio r set as 0 with a probability of 0.5 and uniformly sampled from range [0.1, 0.5] with a probability of 0.5. |