Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling

Authors: Keyu Tian, Yi Jiang, qishuai diao, Chen Lin, Liwei Wang, Zehuan Yuan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate it on both classical (Res Net) and modern (Conv Ne Xt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). The improvements on object detection and instance segmentation are more significant (up to +3.5%), validating the strong transferability of features learned. We also find Spar K s favorable scaling behavior by observing more gains on larger networks. All of these findings support the promising future of generative pre-training on convnets.
Researcher Affiliation Collaboration Keyu Tian1,2,3, Yi Jiang2 , Qishuai Diao2, Chen Lin4, Liwei Wang1 , Zehuan Yuan2 1Center for Data Science, Peking University 2Bytedance Inc 3Pazhou Lab (Huangpu) 4University of Oxford
Pseudocode No The paper provides actual Python code for the decoder architecture in Appendix A, but it does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Both codes and pre-trained models have been released at https://github.com/keyu-tian/Spar K.
Open Datasets Yes All models are pre-trained with 1.28 million unlabeled images from Image Net-1K (Deng et al., 2009) training set for 1600 epochs. ... on COCO (Lin et al., 2014).
Dataset Splits Yes Image Net validation set. ... Top-1 validation accuracy is reported. ... Average precisions of detection box (APbb) and segmentation mask (APmk) on val2017 are reported.
Hardware Specification Yes In practice, we found a sparse Res Net-50 can save ~23% memory footprint (26.4 GB vs. 34.5 GB for single batch size of 128). This allows us to train it on a 32GB Tesla V100, which otherwise is impossible for non-sparse pre-training.
Software Dependencies No The paper mentions using PyTorch (implicitly, through code in Appendix A), Detectron2 (Wu et al., 2019), and MMDetection (Chen et al., 2019) but does not provide specific version numbers for these software components.
Experiment Setup Yes All models are pre-trained with 1.28 million unlabeled images from Image Net-1K (...) training set for 1600 epochs. Only the minimal augmentation is required (random cropping and horizontal flipping). We use the same mask patch size (32) and ratio (60%). We train with a LAMB optimizer (...), a batch size of 4096, and a cosine-annealing learning rate with peak value = 0.0002 batchsize/256. Appendix C and D also provide detailed fine-tuning recipes including image resolution, epochs, optimizer, learning rate, weight decay, etc.