Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

Authors: Siyuan Li, Di Wu, Fang Wu, Zelin Zang, Stan Z. Li

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on popular benchmarks show that A2MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks.
Researcher Affiliation Academia 1AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, 310000, China 2College of Computer Science and Technology, Zhejiang University, Hangzhou, 310000, China 3Institute of AI Industry Research, Tsinghua University, Beijing, 100084, China. Correspondence to: Stan Z. Li <stan.z.li@westlake.edu.cn>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Experiment results and models are available at https://github.c om/Westlake-AI/A2MIM.
Open Datasets Yes Models are pre-trained on Image Net-1K (IN-1K) training set with Adam W (Loshchilov & Hutter, 2019) optimizer, a batch size of 2048, and a basic learning rate of 1.2 x 10^-3 adjusted by a cosine learning rate scheduler. The input image size is 224 x 224 with a masked patch size of 32 x 32, and the random masking ratio is 60%. ... We benchmark CL and MIM methods on object detection and segmentation with COCO (Lin et al., 2014). ... We then evaluate the transferring performances on semantic segmentation with ADE20K (Zhou et al., 2019).
Dataset Splits Yes We evaluate the learned representation by end-to-end fine-tuning (FT) and linear probing (Lin.) protocols on IN-1K. For FT evaluations of ViTs, we employ the fine-tuning as MAE (He et al., 2022), which applies DeiT (Touvron et al., 2021) augmentations, AdamW optimizer with a batch size of 1024 for 200 epochs, and adopt a layer-wise learning rate decay of 0.65 as BEiT (Bao et al., 2022).
Hardware Specification Yes Our experiments are implemented on Open Mixup (Li et al., 2022) by Pytorch and conducted on workstations with NVIDIA A100 GPUs.
Software Dependencies No The paper mentions 'Pytorch' and 'Open Mixup' but does not specify their version numbers, which are necessary for reproducible software dependencies.
Experiment Setup Yes Models are pre-trained on Image Net-1K (IN-1K) training set with Adam W (Loshchilov & Hutter, 2019) optimizer, a batch size of 2048, and a basic learning rate of 1.2 x 10^-3 adjusted by a cosine learning rate scheduler. The input image size is 224 x 224 with a masked patch size of 32 x 32, and the random masking ratio is 60%. By default, the learnable mask tokens are placed at stage-3 and layer-0 in Res Net/Conv Ne Xt and Vi T architectures, respectively. We adopt a linear prediction head as the MIM decoder (Xie et al., 2021b).