Good Helper Is around You: Attention-Driven Masked Image Modeling

Authors: Zhengqi Liu, Jie Gui, Hao Luo

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method with popular MIM methods (MAE, Sim MIM) with linear probing and fine-tuning on Image Net-1K validation set. We validate the transferability of our method on other downstream tasks. We test the classification accuracy on CIFAR-10/100 (Krizhevsky 2009), Tiny Image Net, STL10 (Coates, Ng, and Lee 2011), and Image Net-1K (Deng et al. 2009) by linear probing and fine-tuning. For object detection and segmentation, we finetune on COCO (Lin et al. 2014) and LVIS (Gupta, Dollar, and Girshick 2019). Ablations about the ratio and the part of masking and throwing are also provided.
Researcher Affiliation Collaboration Zhengqi Liu1, Jie Gui*1,2, Hao Luo3 1 Southeast University, Nanjing, China 2 Purple Mountain Laboratories, China 3 Alibaba group, China
Pseudocode Yes Algorithm 1: Algorithm of AMT for MIM
Open Source Code No The paper does not provide an explicit statement about the release of its own source code, nor does it include a link to a code repository for its methodology. It mentions using 'Vi TDet s detectron2 codebase', but this is a third-party tool, not the authors' own implementation.
Open Datasets Yes We test the classification accuracy on CIFAR-10/100 (Krizhevsky 2009), Tiny Image Net, STL10 (Coates, Ng, and Lee 2011), and Image Net-1K (Deng et al. 2009) by linear probing and fine-tuning. For object detection and segmentation, we finetune on COCO (Lin et al. 2014) and LVIS (Gupta, Dollar, and Girshick 2019).
Dataset Splits Yes We evaluate our method with popular MIM methods (MAE, Sim MIM) with linear probing and fine-tuning on Image Net-1K validation set. We fine-tune the models on train2017 set for 90K iterations and evaluate on val2017. We fine-tune models on train set for 75K iterations and evaluate on val set.
Hardware Specification No The paper mentions "All experiments are conducted on a 4-GPU server," which is not specific enough to identify the hardware (e.g., GPU model, CPU, memory).
Software Dependencies No The paper mentions using a "detectron2 codebase" but does not specify version numbers for any software components (e.g., Python, PyTorch, CUDA, or specific libraries).
Experiment Setup Yes We pretrain MAE and Sim MIM on Image Net-1K for 400 epochs and 200 epochs respectively. We choose the Vi T-B/16 as the backbone of the encoder. Typically for MAE, we follow official codes to choose Vi T-B/16 with 8 blocks as decoder. For Sim MIM, the decoder is a linear head. For our AMT, we choose the thrown ratio t = 0.4 and 0.26 for MAE and Sim MIM respectively. To maintain the ratio between masked and visible tokens in original works, the masking ratio is respectively set as r = 0.45 and 0.44 for MAE and Sim MIM. The interval for updating masking weights aw is 40 epochs (10% of the whole pre-training process of MAE, and 20% of Sim MIM).