Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding
Authors: Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, Sercan Ö. Arik, Tomas Pfister3417-3425
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments We first show the benefit of Nes T for data efficient learning and then demonstrate benefits for interpretability and generative modeling. Finally, we present ablation studies to analyze the major constituents of the methods. Experimental setup. Table 1: Test accuracy on CIFAR with input size 32 32. Table 2: Comparison on the Image Net dataset. Table 3: Comparison on Image Net benchmark with Image Net-22K pre-training. |
| Researcher Affiliation | Industry | Zizhao Zhang1, Han Zhang2, Long Zhao2, Ting Chen2, Sercan O. Arık1, Tomas Pfister1 1Google Cloud AI 2Google Research {zizhaoz,zhanghan,longzh,iamtingchen,soarik,tpfister}@google.com |
| Pseudocode | Yes | Figure 1: (Left) Illustration of Nes T with nested transformer hierarchy; (right) the simple pseudo code to generate the architecture. Algorithm 1: Grad GAT |
| Open Source Code | Yes | Source code is available https://github.com/ google-research/nested-transformer. |
| Open Datasets | Yes | Nes T converges faster and requires much less training data to achieve good generalization on both Image Net and small datasets like CIFAR. Table 1: Test accuracy on CIFAR with input size 32 32. The compared convolutional architectures are optimized models for CIFAR. We test Nes T on standard Image Net 2012 benchmarks (Deng et al. 2009). |
| Dataset Splits | Yes | Most recent Vi T-based methods follow the training techniques of Dei T (Touvron et al. 2021a). We follow the settings with minor modifications that we find useful for local self-attention (see Appendix for all architecture and training details). We test Nes T on standard Image Net 2012 benchmarks (Deng et al. 2009) with commonly used 300 epoch training on TPUs in Table 2. The pre-training is 90 epoch on 224 224 Image Net21K images and finetuning is 30 epoch on 384 384 Image Net images. |
| Hardware Specification | No | The paper mentions training 'using a single GPU' and 'on TPUs' but does not provide specific models or detailed hardware specifications such as GPU models (e.g., NVIDIA A100, Tesla V100) or TPU versions (e.g., TPU v2, v3). |
| Software Dependencies | No | The paper states that training techniques follow DeiT (Touvron et al. 2021a) and mentions details in the Appendix, but no specific software dependencies with version numbers (e.g., PyTorch version, CUDA version) are provided in the main body of the text. |
| Experiment Setup | Yes | Experimental setup. We follow previous work (Dosovitskiy et al. 2021) to generate three architectures that have comparable capacity (in number of parameters and FLOPS), noted as tiny (Nes T-T), small (Nes T-S), and base (Nes TB). Most recent Vi T-based methods follow the training techniques of Dei T (Touvron et al. 2021a). We follow the settings with minor modifications that we find useful for local self-attention (see Appendix for all architecture and training details). We test Nes T on standard Image Net 2012 benchmarks (Deng et al. 2009) with commonly used 300 epoch training on TPUs in Table 2. The input size is 224 224 and no extra pre-training data is used. The pre-training is 90 epoch on 224 224 Image Net21K images and finetuning is 30 epoch on 384 384 Image Net images. |