Budgeted Training for Vision Transformer
Authors: zhuofan xia, Xuran Pan, Xuan Jin, Yuan He, Hui Xue', Shiji Song, Gao Huang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our framework is applicable to various Vision Transformers, and achieves competitive performances on a wide range of training budgets. |
| Researcher Affiliation | Collaboration | 1Department of Automation, BNRist, Tsinghua University, Beijing, China 2Beijing Academy of Artificial Intelligence, Beijing, China 3Alibaba Group, Hangzhou, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It refers to third-party repositories and toolboxes used. |
| Open Datasets | Yes | We mainly evaluate our method on the popular Image Net-1K (Deng et al., 2009) dataset for large scale image recognition. ... transfer learning on smaller datasets. For CIFAR-10/100 (Krizhevsky et al., 2009) transfer learning task... For MS-COCO (Lin et al., 2014) object detection and instance segmentation task... For ADE20K (Zhou et al., 2017) semantic segmentation task... |
| Dataset Splits | No | The paper refers to using the 'Image Net-1K validation set' and mentions evaluation on it, but does not explicitly provide the specific training, validation, and test dataset splits (e.g., percentages or sample counts) used for their experiments, nor does it cite a specific source for these splits beyond the dataset itself. |
| Hardware Specification | Yes | All the records of the time are measured on 8 RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions using tools like MMDetection, MMSegmentation, and refers to the DeiT repository but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We follow most of the hyper-parameters of these models including model architectures, data augmentations, and stochastic depth rate (Huang et al., 2016). As discussed in Sec. 4.2, we adjust three factors controlling the training cost of the model, including the number of the activated attention heads M, the MLP hidden dimension C, and the proportion of patch tokens N. For all Vi T models, we choose a moderate α = 2.0 in K = 3 training stages, which are carefully ablated and discussed in Sec. 5.3. More detailed specifications are summarized in Appendix A. |