reproducibilityindex.ai

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

Authors: Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, Sercan Ö. Arik, Tomas Pfister3417-3425

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments We ﬁrst show the beneﬁt of Nes T for data efﬁcient learning and then demonstrate beneﬁts for interpretability and generative modeling. Finally, we present ablation studies to analyze the major constituents of the methods. Experimental setup. Table 1: Test accuracy on CIFAR with input size 32 32. Table 2: Comparison on the Image Net dataset. Table 3: Comparison on Image Net benchmark with Image Net-22K pre-training.
Researcher Affiliation	Industry	Zizhao Zhang1, Han Zhang2, Long Zhao2, Ting Chen2, Sercan O. Arık1, Tomas Pﬁster1 1Google Cloud AI 2Google Research {zizhaoz,zhanghan,longzh,iamtingchen,soarik,tpﬁster}@google.com
Pseudocode	Yes	Figure 1: (Left) Illustration of Nes T with nested transformer hierarchy; (right) the simple pseudo code to generate the architecture. Algorithm 1: Grad GAT
Open Source Code	Yes	Source code is available https://github.com/ google-research/nested-transformer.
Open Datasets	Yes	Nes T converges faster and requires much less training data to achieve good generalization on both Image Net and small datasets like CIFAR. Table 1: Test accuracy on CIFAR with input size 32 32. The compared convolutional architectures are optimized models for CIFAR. We test Nes T on standard Image Net 2012 benchmarks (Deng et al. 2009).
Dataset Splits	Yes	Most recent Vi T-based methods follow the training techniques of Dei T (Touvron et al. 2021a). We follow the settings with minor modiﬁcations that we ﬁnd useful for local self-attention (see Appendix for all architecture and training details). We test Nes T on standard Image Net 2012 benchmarks (Deng et al. 2009) with commonly used 300 epoch training on TPUs in Table 2. The pre-training is 90 epoch on 224 224 Image Net21K images and ﬁnetuning is 30 epoch on 384 384 Image Net images.
Hardware Specification	No	The paper mentions training 'using a single GPU' and 'on TPUs' but does not provide specific models or detailed hardware specifications such as GPU models (e.g., NVIDIA A100, Tesla V100) or TPU versions (e.g., TPU v2, v3).
Software Dependencies	No	The paper states that training techniques follow DeiT (Touvron et al. 2021a) and mentions details in the Appendix, but no specific software dependencies with version numbers (e.g., PyTorch version, CUDA version) are provided in the main body of the text.
Experiment Setup	Yes	Experimental setup. We follow previous work (Dosovitskiy et al. 2021) to generate three architectures that have comparable capacity (in number of parameters and FLOPS), noted as tiny (Nes T-T), small (Nes T-S), and base (Nes TB). Most recent Vi T-based methods follow the training techniques of Dei T (Touvron et al. 2021a). We follow the settings with minor modiﬁcations that we ﬁnd useful for local self-attention (see Appendix for all architecture and training details). We test Nes T on standard Image Net 2012 benchmarks (Deng et al. 2009) with commonly used 300 epoch training on TPUs in Table 2. The input size is 224 224 and no extra pre-training data is used. The pre-training is 90 epoch on 224 224 Image Net21K images and ﬁnetuning is 30 epoch on 384 384 Image Net images.