ViT-NeT: Interpretable Vision Transformers with Neural Tree Decoder

Authors: Sangwon Kim, Jaeyeal Nam, Byoung Chul Ko

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compared the performance of Vi T-Ne T with other state-of-art methods using widely used fine-grained visual categorization benchmark datasets and experimentally proved that the proposed method is superior in terms of the classification performance and interpretability.
Researcher Affiliation Academia Sangwon Kim 1 Jaeyeal Nam 1 Byoung Chul Ko 1 1Department of Computer Engineering, Keimyung University, Daegu, South Korea.
Pseudocode Yes Algorithm 1 Training a Vi T-Ne T
Open Source Code Yes The code and models are publicly available at https://github.com/ jumpsnack/Vi T-Ne T.
Open Datasets Yes Datasets We evaluated our Vi T-Ne T on three FGVC datasets: CUB-200-2011 (Wah et al., 2011), Stanford Cars (Krause et al., 2013), and Stanford Dogs (Khosla et al., 2011), and compared our model with previous SOTA models in terms of accuracy and interpretability.
Dataset Splits No The paper provides details for training and testing splits for each dataset (e.g., 'CUB-200-2011... 5,994 training images and 5,794 testing images'), but does not explicitly mention a validation set split.
Hardware Specification Yes Training and testing were conducted using four NVIDIA Tesla V100 32GB GPUs with APEX.
Software Dependencies No The paper mentions software like 'Py Torch', 'Adam W optimizer', and 'APEX', but does not provide specific version numbers for these software components.
Experiment Setup Yes The learning rate was initialized as 2e-5 for CUB-200-2011, 2e-4 for Stanford Dogs, and 2e-3 for Stanford Cars. The batch size was set to 16.