Patch-Aware Sample Selection for Efficient Masked Image Modeling

Authors: Zhengyang Zhuge, Jiaxing Wang, Yong Li, Yongjun Bao, Peisong Wang, Jian Cheng

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show the effectiveness of PASS in selecting the most informative subset and accelerating pretraining. PASS exhibits superior performance across various datasets, MIM methods, and downstream tasks. Particularly, PASS improves MAE by 0.7% on Image Net-1K while utilizing only 37% data budget and achieves 1.7 speedup.
Researcher Affiliation Collaboration Zhengyang Zhuge1, 2, Jiaxing Wang3, Yong Li3, Yongjun Bao3, Peisong Wang1, 2, 4, Jian Cheng1, 2, 4* 1 Institute of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences 3 JD.com 4 Ai Ri A zhugezhengyang2020@ia.ac.cn, {wangjiaxing41, liyong5, baoyongjun}@jd.com {peisong.wang, jcheng}@nlpr.ia.ac.cn
Pseudocode Yes Algorithm 1: Patch-Aware Sample Selection
Open Source Code No The paper does not contain an explicit statement about the release of source code for the described methodology or a direct link to a code repository.
Open Datasets Yes In this paper, we apply our method to popular MIM methods (MAE, sim MIM), and evaluate with the linear probing and finetuning classification task on Image Net-1K (Deng et al. 2009). Furthermore, we test the transferability of our method on other classification datasets such as CIFAR-10/100 (Krizhevsky 2009) and STL-10 (Coates, Ng, and Lee 2011). Additionally, to evaluate the generalization on semantics tasks, we conduct object detection and instance segmentation experiments on MS-COCO (Lin et al. 2014) and semantic segmentation on ADE20K (Zhou et al. 2019).
Dataset Splits No The paper uses datasets like ImageNet-1K, CIFAR-10/100, STL-10, MS-COCO, and ADE20K, and mentions using a '37% data budget' for pre-training with sample selection. It also evaluates on MS-COCO val2017. However, it does not explicitly provide the specific percentages or sample counts for training/validation/test splits, nor does it detail a stratified or group-based splitting methodology for all datasets, which is necessary for full reproducibility of data partitioning.
Hardware Specification Yes All experiments are conducted on 8 RTX-3090 GPUs.
Software Dependencies No The paper mentions using specific models and frameworks such as MAE, sim MIM, Vi T backbones, Mask R-CNN, FPNs, and detectron2, but it does not provide specific version numbers for these software components or any other ancillary software dependencies like programming languages or deep learning libraries.
Experiment Setup Yes We pre-train MAE and sim MIM on Image Net-1K for 200 epochs following (He et al. 2022; Xie et al. 2022a). Our method is applicable to various Vi T backbones, although the experiments are mainly conducted with Vi T-B/16 encoder due to constrained computation resources. For pre-training, we patchify the image of 224 224 into 14 14 patches. We adopt the model with decoder with 8 blocks for MAE, while for sim MIM, a linear head is used as the decoder. For fine-tuning, the decoder is omitted, and a fully-connected layer with an n-way output (n =1000 for Image Net-1k) is appended to the output of the encoder as the classifier. For linear probing, we only train the last linear head while keeping the other layers frozen. For PASS, we perform sample selection every 20 epochs during pre-training. For the first 20 epochs, we use the full training dataset, while for the following epochs, ρ|T | subset is selected using PASS for training according to the predefined data budget ρ. We adopt a mask predictor consisting of nd (by default nd = 8) blocks for the Dynamic Trained Mask Predictor (DTMP). While for dynamic training, we set γ = 3. For Weighted Selection Score (WSS), we set τ = 0.1. We set the mask ratio to 0.75 for MAE and 0.6 for sim MIM, and the mask ratio remains consistent between the selection stage and pre-training stage. All pretrained models are finetuned on the MS-COCO train2017 for 1 (12 epochs) with a resolution of 1024 1024 and batch size 16. We utilize Uper Net as the segmentation model and conduct finetuning for 80k iterations with a resolution of 512 512.