Masked Image Residual Learning for Scaling Deeper Vision Transformers
Authors: Guoxi Huang, Hongtao Fu, Adrian G. Bors
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed MIRL method is evaluated on image classification, object detection and semantic segmentation tasks. All models are pre-trained on Image Net-1K and then fine-tuned in downstream tasks. ... Table 2: MIRL ablation experiments on Image Net-1K |
| Researcher Affiliation | Collaboration | Guoxi Huang Baidu Inc. huangguoxi@baidu.com Hongtao Fu Huazhong University of Science and Technology m202173233@hust.edu.cn Adrian G. Bors University of York adrian.bors@york.ac.uk |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and pretrained models are available at: https://github.com/russellllaputa/MIRL. |
| Open Datasets | Yes | We pre-train all models on the training set of Image Net-1K with 32 GPUs. ... The experiment is conducted on MS COCO [30]... We compare our method with previous results on the ADE20K [61] dataset |
| Dataset Splits | Yes | All models are pre-trained on Image Net-1K and then fine-tuned in downstream tasks. ... Table 2: MIRL ablation experiments on Image Net-1K: We report the fine-tuning (ft) accuracy(%) for all models, which are pre-trained for 300 epochs. |
| Hardware Specification | No | We pre-train all models on the training set of Image Net-1K with 32 GPUs. |
| Software Dependencies | No | The paper mentions frameworks and libraries such as Transformer architecture, MAE, Mask R-CNN, and mmdetection, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Pre-training setup. We pre-train all models on the training set of Image Net-1K with 32 GPUs. By default, Vi T-B-24 is divided into 4 segments, while Vi T-S-54 and Vi T-B-48 are split into 6 segments, and others into 2. Each appended decoder has 2 Transformer blocks with an injected DID module. We follow the setup in [21], masking 75% of visual tokens and applying basic data augmentation, including random horizontal flipping and random resized cropping. Full implementation details are in Appendix A. |