Masked Image Modeling with Denoising Contrast

Authors: Kun Yi, Yixiao Ge, Xiaotong Li, Shusheng Yang, Dian Li, Jianping Wu, Ying Shan, Xiaohu Qie

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform self-supervised learning with Con MIM on Image Net (Deng et al., 2009), and then fine-tune the pre-trained vision Transformers with various scales on image classification, semantic segmentation, object detection and instance segmentation. We evaluate the models pre-trained by our Con MIM on different downstream tasks, including image classification on Image Net-1K (Deng et al., 2009) (Sec. 5.1), semantic segmentation on ADE20K (Zhou et al., 2017) (Sec. 5.2), object detection and instance segmentation on COCO (Lin et al., 2014) (Sec. 5.3). We further discuss the key components in Con MIM pre-training via ablation studies in Sec. 5.4.
Researcher Affiliation Collaboration 1ARC Lab, 2Tencent PCG 3Foundation Technology Center, 4Peking University 5Huazhong University of Science and Technology 6Tsinghua University
Pseudocode Yes Algorithm 1 Pseudocode of Con MIM pre-training in a Py Torch-like style.
Open Source Code No Code will be available at https://github.com/Tencent ARC/Con MIM.
Open Datasets Yes Con MIM pre-training is conducted on the training set of Image Net-1K (Deng et al., 2009) dataset in a self-supervised manner. We evaluate the models pre-trained by our Con MIM on different downstream tasks, including image classification on Image Net-1K (Deng et al., 2009) (Sec. 5.1), semantic segmentation on ADE20K (Zhou et al., 2017) (Sec. 5.2), object detection and instance segmentation on COCO (Lin et al., 2014) (Sec. 5.3).
Dataset Splits Yes Con MIM pre-training is conducted on the training set of Image Net-1K (Deng et al., 2009) dataset in a self-supervised manner. We test our Con MIM by fine-tuning the pre-trained models on Image Net-1K (Deng et al., 2009) classification, which contains 1.3M images out of 1K classes in total. We mostly follow the finetuning setup of BEi T (Bao et al., 2022).
Hardware Specification Yes We adopt 16 A100 GPUs for pre-training (32 A100 GPUs for Vi T-L/16).
Software Dependencies No The paper mentions 'Py Torch-like style' for the pseudocode, implying the use of PyTorch. However, it does not provide specific version numbers for PyTorch or any other software dependencies needed to reproduce the experiment.
Experiment Setup Yes Pre-training setup. Con MIM pre-training is conducted on the training set of Image Net-1K (Deng et al., 2009) dataset in a self-supervised manner. We utilize Vi T-S/16, Vi T-B/16 and Vi T-L/16 (Dosovitskiy et al., 2021) as the backbone networks... The input images are all resized to 224 224 and the patch size is set to 16 16. We follow the masking strategy of MAE (He et al., 2022), i.e., 75% patches are randomly masked. The learning rate is set to 5e-4, with a warmup of 10 epochs, and cosine learning rate decay. The temperature τ is set to 0.1 and the momentum coefficient α is initially set to 0.996 with a cosine scheduler. Vi T-B/16 and Vi T-L/16 are pre-trained for 800 epochs in total and Vi T-S/16 is pre-trained for 300 epochs if not specified. More implementation details can be found in Appendix A. Appendix A.1 provides Table 11: Hyper-parameters for Con MIM pre-training on Image Net-1K, listing detailed settings like Layers, Hidden size, Batch size, Adam parameters, etc.