Masked Frequency Modeling for Self-Supervised Visual Pre-Training

Authors: Jiahao Xie, Wei Li, Xiaohang Zhan, Ziwei Liu, Yew-Soon Ong, Chen Change Loy

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches.
Researcher Affiliation Academia Jiahao Xie1,2, Wei Li1,2, Xiaohang Zhan3, Ziwei Liu1,2, Yew Soon Ong2,4, Chen Change Loy1,2 1S-Lab, NTU 2SCSE, NTU 3CUHK 4A*STAR, Singapore {jiahao003, wei.l, ziwei.liu, asysong, ccloy}@ntu.edu.sg xiaohangzhan@outlook.com
Pseudocode Yes Algorithm 1 Pseudocode of MFM in a Py Torch-like style.
Open Source Code Yes Project page: https://www.mmlab-ntu.com/project/mfm/index.html. Code and models will be released at https://www.mmlab-ntu.com/project/mfm/index.html to facilitate future research.
Open Datasets Yes We perform self-supervised pre-training on the Image Net-1K (Deng et al., 2009) training set without labels. For Vi T, our pre-training setting generally follows BEi T (Bao et al., 2022), while we only use random resized cropping (224 224 resolution) and flipping as data augmentation, with dropout and stochastic depth not applied. We also do not use relative position or layer scaling. After pre-training, we conduct supervised end-to-end fine-tuning on Image Net-1K image classification and ADE20K (Zhou et al., 2017) semantic segmentation to evaluate the quality of learned representations, following BEi T (Bao et al., 2022).
Dataset Splits Yes All experiments are conducted with 300-epoch pre-training and 100-epoch fine-tuning on the Image Net-1K dataset unless otherwise specified. We evaluate the robustness of our models on a series of benchmarks in three aspects: (i) adversarial robustness, (ii) common corruption robustness, and (iii) out-of-distribution robustness. For (i), we study the adversarial examples generated by white-box attackers (e.g., FGSM (Goodfellow et al., 2014) and PGD (Madry et al., 2017)) on Image Net-1K validation set as well as natural adversarial examples on Image Net-A (Hendrycks et al., 2021b); for (ii), we evaluate on Image Net-C (Hendrycks & Dietterich, 2019) that includes 15 types of algorithmically generated corruptions with five levels of severity; for (iii), we test on Image Net-R (Hendrycks et al., 2021a) and Image Net-Sketch (Wang et al., 2019) that contain images with naturally occurring distribution shifts. We evaluate the same models fine-tuned on original Image Net-1K (Vi T-B/16 in Table 3 and Res Net-50 in Table 4b) without any specialized fine-tuning on the different validation sets.
Hardware Specification Yes All experiments are conducted on 16 V100 32G GPUs for Vi T models and 8 V100 32G GPUs for Res Net-50.
Software Dependencies No The paper mentions "Py Torch-like style" in its pseudocode and references optimizers like "Adam W (Loshchilov & Hutter, 2017)" but does not provide specific version numbers for software dependencies such as PyTorch, CUDA, or other libraries.
Experiment Setup Yes Table 11: Pre-training settings for vanilla Vi T-S/16, Vi T-B/16 and Res Net-50 models on Image Net-1K. Table 12: Fine-tuning settings for vanilla Vi T-S/16 and Vi T-B/16 on Image Net-1K. Table 13: Fine-tuning settings for vanilla Res Net-50 on Image Net-1K. The hyper-parameters generally follow Wightman et al. (2021), except that we adopt the Adam W optimizer following Fang et al. (2022).