Good Helper Is around You: Attention-Driven Masked Image Modeling
Authors: Zhengqi Liu, Jie Gui, Hao Luo
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method with popular MIM methods (MAE, Sim MIM) with linear probing and fine-tuning on Image Net-1K validation set. We validate the transferability of our method on other downstream tasks. We test the classification accuracy on CIFAR-10/100 (Krizhevsky 2009), Tiny Image Net, STL10 (Coates, Ng, and Lee 2011), and Image Net-1K (Deng et al. 2009) by linear probing and fine-tuning. For object detection and segmentation, we finetune on COCO (Lin et al. 2014) and LVIS (Gupta, Dollar, and Girshick 2019). Ablations about the ratio and the part of masking and throwing are also provided. |
| Researcher Affiliation | Collaboration | Zhengqi Liu1, Jie Gui*1,2, Hao Luo3 1 Southeast University, Nanjing, China 2 Purple Mountain Laboratories, China 3 Alibaba group, China |
| Pseudocode | Yes | Algorithm 1: Algorithm of AMT for MIM |
| Open Source Code | No | The paper does not provide an explicit statement about the release of its own source code, nor does it include a link to a code repository for its methodology. It mentions using 'Vi TDet s detectron2 codebase', but this is a third-party tool, not the authors' own implementation. |
| Open Datasets | Yes | We test the classification accuracy on CIFAR-10/100 (Krizhevsky 2009), Tiny Image Net, STL10 (Coates, Ng, and Lee 2011), and Image Net-1K (Deng et al. 2009) by linear probing and fine-tuning. For object detection and segmentation, we finetune on COCO (Lin et al. 2014) and LVIS (Gupta, Dollar, and Girshick 2019). |
| Dataset Splits | Yes | We evaluate our method with popular MIM methods (MAE, Sim MIM) with linear probing and fine-tuning on Image Net-1K validation set. We fine-tune the models on train2017 set for 90K iterations and evaluate on val2017. We fine-tune models on train set for 75K iterations and evaluate on val set. |
| Hardware Specification | No | The paper mentions "All experiments are conducted on a 4-GPU server," which is not specific enough to identify the hardware (e.g., GPU model, CPU, memory). |
| Software Dependencies | No | The paper mentions using a "detectron2 codebase" but does not specify version numbers for any software components (e.g., Python, PyTorch, CUDA, or specific libraries). |
| Experiment Setup | Yes | We pretrain MAE and Sim MIM on Image Net-1K for 400 epochs and 200 epochs respectively. We choose the Vi T-B/16 as the backbone of the encoder. Typically for MAE, we follow official codes to choose Vi T-B/16 with 8 blocks as decoder. For Sim MIM, the decoder is a linear head. For our AMT, we choose the thrown ratio t = 0.4 and 0.26 for MAE and Sim MIM respectively. To maintain the ratio between masked and visible tokens in original works, the masking ratio is respectively set as r = 0.45 and 0.44 for MAE and Sim MIM. The interval for updating masking weights aw is 40 epochs (10% of the whole pre-training process of MAE, and 20% of Sim MIM). |