Transformer in Transformer
Authors: Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing XU, Yunhe Wang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the Image Net, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost. |
| Researcher Affiliation | Collaboration | Kai Han1,2 An Xiao2 Enhua Wu1,3 Jianyuan Guo2 Chunjing Xu2 Yunhe Wang2 1State Key Lab of Computer Science, ISCAS & UCAS 2Huawei Noah s Ark Lab 3University of Macau |
| Pseudocode | No | The paper provides architectural descriptions, mathematical formulas for components like MSA, MLP, and LN, and an illustration in Figure 1, but it does not contain a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | The Py Torch code is available at https://github.com/huawei-noah/CV-Backbones, and the Mind Spore code is available at https://gitee.com/mindspore/models/ tree/master/research/cv/TNT. |
| Open Datasets | Yes | Image Net ILSVRC 2012 [26] is an image classification benchmark consisting of 1.2M training images belonging to 1000 classes, and 50K validation images with 50 images per class. ... The details of used visual datasets are listed in Table 2. ... For the license of Image Net dataset, please refer to http://www.image-net.org/download. ... For the licenses of these datasets, please refer to the original papers. |
| Dataset Splits | Yes | Image Net ILSVRC 2012 [26] is an image classification benchmark consisting of 1.2M training images belonging to 1000 classes, and 50K validation images with 50 images per class. ... The details of used visual datasets are listed in Table 2. |
| Hardware Specification | Yes | All the models are implemented with Py Torch [24] and Mind Spore [15] and trained on NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions software like Py Torch [24] and Mind Spore [15] but does not specify their version numbers or versions for any other key libraries or dependencies. |
| Experiment Setup | Yes | We utilize the training strategy provided in Dei T [31]. The main advanced technologies apart from common settings [12] include Adam W [20], label smoothing [27], Drop Path [18], and repeated augmentation [14]. We list the hyper-parameters in Table 3 for better understanding. ... Table 3: Default training hyper-parameters used in our method, unless stated otherwise. Epochs Optimizer Batch Learning LR Weight Warmup Label Drop Repeated size rate decay decay epochs smooth path Aug 300 Adam W 1024 1e-3 cosine 0.05 5 0.1 0.1 |