Analyzing Vision Transformers for Image Classification in Class Embedding Space
Authors: Martina G. Vilas, Timothy Schaumlöffel, Gemma Roig
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach using a variety of Vi Ts that differ in their patch size, layer depth, training dataset, and use of architectural modifications. Specifically, we separately probed: 1) a vanilla Vi T with 12 blocks and a patch size of 16 pre-trained using the Image Net-21k dataset and fine-tuned on Image Net-1k (Vi T-B/16) [9]; 2) a variant with a bigger patch size of 32 (Vi T-B/32) [9]; 3) a deeper version with 24 blocks (Vi T-L/16) [9]; 4) a variant fine-tuned on the CIFAR100 dataset [14]; 5) a version pre-trained on an alternative Imagenet-21K dataset with higher-quality semantic labels (MIIL) [20]; 6) a modified architecture with a refinement module that aligns the intermediate representations of all tokens to class space (Refinement) [15]; and 7) an alternate version trained with Global Average Pooling (GAP) instead of the [CLS] token. |
| Researcher Affiliation | Academia | Martina G. Vilas Goethe University Frankfurt Ernst Strüngmann Institute Timothy Schaumlöffel Goethe University Frankfurt The Hessian Center for AI Gemma Roig Goethe University Frankfurt The Hessian Center for AI |
| Pseudocode | No | The paper includes mathematical equations and descriptive text for its methods, but no explicitly labeled 'Pseudocode' or 'Algorithm' block, nor any code-like formatted steps. |
| Open Source Code | Yes | Our code is available at https://github.com/martinagvilas/vit-cls_emb |
| Open Datasets | Yes | Specifically, we separately probed: 1) a vanilla Vi T with 12 blocks and a patch size of 16 pre-trained using the Image Net-21k dataset and fine-tuned on Image Net-1k (Vi T-B/16) [9]; 4) a variant fine-tuned on the CIFAR100 dataset [14]; and used the Image Net-S dataset [10], which consists of a sub-selection of images from Image Net accompanied by semantic segmentation annotations. |
| Dataset Splits | No | We analyzed the representations of 5 randomly sampled images of every class from the validation set. The paper mentions using 'validation set' and 'training dataset' but does not specify the explicit proportions or methodology for creating the dataset splits. |
| Hardware Specification | No | We are grateful to access to the computing facilities of the Center for Scientific Computing at Goethe University, and of the Ernst Strüngmann Institute for Neuroscience. This statement refers to general computing facilities without providing specific hardware details such as GPU/CPU models or memory specifications. |
| Software Dependencies | No | The paper does not provide specific software dependencies (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | No | The paper specifies the Vision Transformer models used, including their patch size and layer depth, and mentions fine-tuning. However, it does not explicitly provide concrete hyperparameter values such as learning rates, batch sizes, number of epochs, or optimizer settings for training. |