Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers
Authors: Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu, Baining Guo
AAAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on Image Net-1K with Vi T-B backbone, outperforming the competitive method BEi T by +1.3% under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone Vi T-H, we achieve the state-of-the-art Image Net accuracy (88.3%) among methods using only Image Net-1K data. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2Microsoft Research Asia 3Microsoft Cloud + AI {dlight@mail., zhangwm@, ynh@}.ustc.edu.cn EMAIL EMAIL |
| Pseudocode | No | No pseudocode or algorithm block found. |
| Open Source Code | No | No explicit statement or link providing access to the source code for the methodology described in this paper was found. |
| Open Datasets | Yes | We train the perceptual codebook using the training set of Image Net-1K dataset by default. The model is pre-trained for 300/800 epochs with the batchsize of 2048. We fine-tune the pre-trained model on various downstream tasks: image classification, object detection, and semantic segmentation. Experimental results show that our pre-trained model transfers better than BEi T with only the prediction target changed. Concretely, we achieve 84.5% Top-1 accuracy on Image Net1K with Vi T-B model, outperforming BEi T by +1.3% with the same 800 pre-training epochs. Our approach also gets significant improvement on COCO object detection and semantic segmentation as well as on ADE20K semantic segmentation. |
| Dataset Splits | No | No specific training/validation/test dataset splits were explicitly stated using percentages or sample counts. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided. |
| Software Dependencies | No | No specific software dependencies with version numbers were provided (e.g., Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | The model is pre-trained for 300/800 epochs with the batchsize of 2048. We use a block-wise masking strategy for obtaining the corrupted images with the same setup as BEi T (Bao, Dong, and Wei 2021). We finetune the model with 100 epochs and a cosine decay learning rate that warmups to 4e 3 with 20 epochs and decays to 0. Following (Bao, Dong, and Wei 2021), the layerwise learning rate decay is also used and set to 0.65 by default. |