reproducibilityindex.ai

Image BERT Pre-training with Online Tokenizer

Authors: Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show the prominence of i BOT by achieving an 82.3% linear probing accuracy and an 87.8% ﬁne-tuning accuracy evaluated on Image Net1K. Beyond the state-of-the-art image classiﬁcation results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, e.g., object detection, instance segmentation, and semantic segmentation.
Researcher Affiliation	Collaboration	Jinghao Zhou1 Chen Wei2 Huiyu Wang2 Wei Shen3 Cihang Xie4 Alan Yuille2 Tao Kong1 1Byte Dance 2Johns Hopkins University 3Shanghai Jiao Tong University 4UC Santa Cruz
Pseudocode	Yes	A PSEUDOCODE Algorithm 1: i BOT Py Torch-like Pseudocode w/o multi-crop augmentation
Open Source Code	Yes	The code and models are publicly available at https://github.com/bytedance/ibot.
Open Datasets	Yes	We pre-train i BOT on Image Net-1K (Deng et al., 2009) training set with Adam W (Loshchilov & Hutter, 2019) optimizer and a batch size of 1024. We also pre-train on Image Net-22K training set with Vi T-B/16 for 80 epochs and Vi T-L/16 for 50 epochs.
Dataset Splits	Yes	For k-NN evaluation, we sweep over different numbers of nearest neighbors. For linear evaluation, we sweep over different learning rates.
Hardware Specification	Yes	All methods are trained on two 8-GPU V100 machines with a batch size of 1024.
Software Dependencies	No	The paper mentions 'Adam W (Loshchilov & Hutter, 2019) optimizer' but does not specify software versions for any libraries, frameworks, or languages used.
Experiment Setup	Yes	We use the Vision Transformers (Dosovitskiy et al., 2021) and Swin Transformers (Liu et al., 2021b) with different amounts of parameters, Vi T-S/16, Vi T-B/16, Vi T-L/16, and Swin T/{7,14} as the backbone f. ... We set the output dimension of the shared head to 8192. ... We by default pre-train i BOT on Image Net-1K (Deng et al., 2009) training set with Adam W (Loshchilov & Hutter, 2019) optimizer and a batch size of 1024. We pre-train i BOT with Vi T-S/16 for 800 epochs, Vi T-B/16 for 400 epochs, Vi T-L/16 for 250 epochs, and Swin T/{7,14} for 300 epochs. ... The learning rate is linearly ramped up during the ﬁrst 10 epochs to its base value scaled with the total batch size: lr = 5e 4 batch size/256. We use random MIM, with prediction ratio r set as 0 with a probability of 0.5 and uniformly sampled from range [0.1, 0.5] with a probability of 0.5.