Masked Image Residual Learning for Scaling Deeper Vision Transformers

Authors: Guoxi Huang, Hongtao Fu, Adrian G. Bors

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed MIRL method is evaluated on image classification, object detection and semantic segmentation tasks. All models are pre-trained on Image Net-1K and then fine-tuned in downstream tasks. ... Table 2: MIRL ablation experiments on Image Net-1K
Researcher Affiliation Collaboration Guoxi Huang Baidu Inc. huangguoxi@baidu.com Hongtao Fu Huazhong University of Science and Technology m202173233@hust.edu.cn Adrian G. Bors University of York adrian.bors@york.ac.uk
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code and pretrained models are available at: https://github.com/russellllaputa/MIRL.
Open Datasets Yes We pre-train all models on the training set of Image Net-1K with 32 GPUs. ... The experiment is conducted on MS COCO [30]... We compare our method with previous results on the ADE20K [61] dataset
Dataset Splits Yes All models are pre-trained on Image Net-1K and then fine-tuned in downstream tasks. ... Table 2: MIRL ablation experiments on Image Net-1K: We report the fine-tuning (ft) accuracy(%) for all models, which are pre-trained for 300 epochs.
Hardware Specification No We pre-train all models on the training set of Image Net-1K with 32 GPUs.
Software Dependencies No The paper mentions frameworks and libraries such as Transformer architecture, MAE, Mask R-CNN, and mmdetection, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Pre-training setup. We pre-train all models on the training set of Image Net-1K with 32 GPUs. By default, Vi T-B-24 is divided into 4 segments, while Vi T-S-54 and Vi T-B-48 are split into 6 segments, and others into 2. Each appended decoder has 2 Transformer blocks with an injected DID module. We follow the setup in [21], masking 75% of visual tokens and applying basic data augmentation, including random horizontal flipping and random resized cropping. Full implementation details are in Appendix A.