Analyzing Vision Transformers for Image Classification in Class Embedding Space

Authors: Martina G. Vilas, Timothy Schaumlöffel, Gemma Roig

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach using a variety of Vi Ts that differ in their patch size, layer depth, training dataset, and use of architectural modifications. Specifically, we separately probed: 1) a vanilla Vi T with 12 blocks and a patch size of 16 pre-trained using the Image Net-21k dataset and fine-tuned on Image Net-1k (Vi T-B/16) [9]; 2) a variant with a bigger patch size of 32 (Vi T-B/32) [9]; 3) a deeper version with 24 blocks (Vi T-L/16) [9]; 4) a variant fine-tuned on the CIFAR100 dataset [14]; 5) a version pre-trained on an alternative Imagenet-21K dataset with higher-quality semantic labels (MIIL) [20]; 6) a modified architecture with a refinement module that aligns the intermediate representations of all tokens to class space (Refinement) [15]; and 7) an alternate version trained with Global Average Pooling (GAP) instead of the [CLS] token.
Researcher Affiliation Academia Martina G. Vilas Goethe University Frankfurt Ernst Strüngmann Institute Timothy Schaumlöffel Goethe University Frankfurt The Hessian Center for AI Gemma Roig Goethe University Frankfurt The Hessian Center for AI
Pseudocode No The paper includes mathematical equations and descriptive text for its methods, but no explicitly labeled 'Pseudocode' or 'Algorithm' block, nor any code-like formatted steps.
Open Source Code Yes Our code is available at https://github.com/martinagvilas/vit-cls_emb
Open Datasets Yes Specifically, we separately probed: 1) a vanilla Vi T with 12 blocks and a patch size of 16 pre-trained using the Image Net-21k dataset and fine-tuned on Image Net-1k (Vi T-B/16) [9]; 4) a variant fine-tuned on the CIFAR100 dataset [14]; and used the Image Net-S dataset [10], which consists of a sub-selection of images from Image Net accompanied by semantic segmentation annotations.
Dataset Splits No We analyzed the representations of 5 randomly sampled images of every class from the validation set. The paper mentions using 'validation set' and 'training dataset' but does not specify the explicit proportions or methodology for creating the dataset splits.
Hardware Specification No We are grateful to access to the computing facilities of the Center for Scientific Computing at Goethe University, and of the Ernst Strüngmann Institute for Neuroscience. This statement refers to general computing facilities without providing specific hardware details such as GPU/CPU models or memory specifications.
Software Dependencies No The paper does not provide specific software dependencies (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup No The paper specifies the Vision Transformer models used, including their patch size and layer depth, and mentions fine-tuning. However, it does not explicitly provide concrete hyperparameter values such as learning rates, batch sizes, number of epochs, or optimizer settings for training.