PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Authors: Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu, Baining Guo

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on Image Net-1K with Vi T-B backbone, outperforming the competitive method BEi T by +1.3% under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone Vi T-H, we achieve the state-of-the-art Image Net accuracy (88.3%) among methods using only Image Net-1K data.
Researcher Affiliation Collaboration 1University of Science and Technology of China 2Microsoft Research Asia 3Microsoft Cloud + AI {dlight@mail., zhangwm@, ynh@}.ustc.edu.cn cddlyf@gmail.com {jianbao, ting.zhang, luyuan, doch, fangwen, bainguo }@microsoft.com
Pseudocode No No pseudocode or algorithm block found.
Open Source Code No No explicit statement or link providing access to the source code for the methodology described in this paper was found.
Open Datasets Yes We train the perceptual codebook using the training set of Image Net-1K dataset by default. The model is pre-trained for 300/800 epochs with the batchsize of 2048. We fine-tune the pre-trained model on various downstream tasks: image classification, object detection, and semantic segmentation. Experimental results show that our pre-trained model transfers better than BEi T with only the prediction target changed. Concretely, we achieve 84.5% Top-1 accuracy on Image Net1K with Vi T-B model, outperforming BEi T by +1.3% with the same 800 pre-training epochs. Our approach also gets significant improvement on COCO object detection and semantic segmentation as well as on ADE20K semantic segmentation.
Dataset Splits No No specific training/validation/test dataset splits were explicitly stated using percentages or sample counts.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided.
Software Dependencies No No specific software dependencies with version numbers were provided (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes The model is pre-trained for 300/800 epochs with the batchsize of 2048. We use a block-wise masking strategy for obtaining the corrupted images with the same setup as BEi T (Bao, Dong, and Wei 2021). We finetune the model with 100 epochs and a cosine decay learning rate that warmups to 4e 3 with 20 epochs and decays to 0. Following (Bao, Dong, and Wei 2021), the layerwise learning rate decay is also used and set to 0.65 by default.