PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers
Authors: Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu, Baining Guo
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on Image Net-1K with Vi T-B backbone, outperforming the competitive method BEi T by +1.3% under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone Vi T-H, we achieve the state-of-the-art Image Net accuracy (88.3%) among methods using only Image Net-1K data. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2Microsoft Research Asia 3Microsoft Cloud + AI {dlight@mail., zhangwm@, ynh@}.ustc.edu.cn cddlyf@gmail.com {jianbao, ting.zhang, luyuan, doch, fangwen, bainguo }@microsoft.com |
| Pseudocode | No | No pseudocode or algorithm block found. |
| Open Source Code | No | No explicit statement or link providing access to the source code for the methodology described in this paper was found. |
| Open Datasets | Yes | We train the perceptual codebook using the training set of Image Net-1K dataset by default. The model is pre-trained for 300/800 epochs with the batchsize of 2048. We fine-tune the pre-trained model on various downstream tasks: image classification, object detection, and semantic segmentation. Experimental results show that our pre-trained model transfers better than BEi T with only the prediction target changed. Concretely, we achieve 84.5% Top-1 accuracy on Image Net1K with Vi T-B model, outperforming BEi T by +1.3% with the same 800 pre-training epochs. Our approach also gets significant improvement on COCO object detection and semantic segmentation as well as on ADE20K semantic segmentation. |
| Dataset Splits | No | No specific training/validation/test dataset splits were explicitly stated using percentages or sample counts. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided. |
| Software Dependencies | No | No specific software dependencies with version numbers were provided (e.g., Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | The model is pre-trained for 300/800 epochs with the batchsize of 2048. We use a block-wise masking strategy for obtaining the corrupted images with the same setup as BEi T (Bao, Dong, and Wei 2021). We finetune the model with 100 epochs and a cosine decay learning rate that warmups to 4e 3 with 20 epochs and decays to 0. Following (Bao, Dong, and Wei 2021), the layerwise learning rate decay is also used and set to 0.65 by default. |