Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
ComFe: An Interpretable Head for Vision Transformers
Authors: Evelyn Mannix, Liam Hodgkinson, Howard Bondell
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Com Fe achieves competitive results with comparable non-interpretable approaches, and provides improved performance on a range of Image Net-1K generalisability and robustness benchmarks. Demonstrate the competitive performance of Com Fe on a range of datasets using a consistent set of hyperparameters in comparison to other interpretable approaches that finetune the backbone networks and tune the hyperparameters for each dataset. Table 1: Interpretable performance comparison. Performance (top-1 accuracy) of Com Fe and other interpretable image classification approaches on several benchmarking datasets. Table 3: Generalisation and robustness benchmarks. Performance (top-1 accuracy) of Com Fe versus a linear head with frozen features on Image Net-1K generalisation and robustness benchmarks. |
| Researcher Affiliation | Academia | Evelyn J. Mannix EMAIL School of Mathematics and Statistics University of Melbourne Liam Hodgkinson EMAIL School of Mathematics and Statistics University of Melbourne Howard Bondell EMAIL School of Mathematics and Statistics University of Melbourne |
| Pseudocode | Yes | Algorithm 1 Algorithm for training Com Fe. Input: Training set T, NE number of epochs, f backbone model, Aug(.) augmentation strategy that creates two augmentations per image with the same cropping and flipping operations Randomly initialise transformer decoder head gθ, input queries Q, class prototypes C and generate class assignment matrix ϕ; i = 0; while i < NE do Randomly split T into B mini-batches; for (xb, yb) {T1, ..., Tb, ..., TB} do X = Aug(xb) ν = One Hot(yb) if using background class prototypes then ν = {ν, [1, ..., 1, ..., 1]}; ▷ Add the background class to all images. end if Z = f(X); P = gθ(Z, Q); |
| Open Source Code | Yes | Code is available at github.com/emannix/comfe-component-features. |
| Open Datasets | Yes | Finegrained image benchmarking datasets including Oxford Pets (37 classes) (Parkhi et al., 2012), FGVC Aircraft (100 classes) (Maji et al., 2013), Stanford Cars (196 classes) (Krause et al., 2013) and CUB200 (200 classes) (Wah et al., 2011) have all previously been used to benchmark interpretable computer vision models. These are used to test the performance of Com Fe, in addition to other datasets including Image Net-1K (1000 classes) (Russakovsky et al., 2015), CIFAR-10 (10 classes), CIFAR-100 (100 classes) (Krizhevsky et al., 2009), Flowers-102 (102 classes) (Nilsback & Zisserman, 2008) and Food-101 (101 classes) (Bossard et al., 2014). |
| Dataset Splits | Yes | Finegrained image benchmarking datasets including Oxford Pets (37 classes) (Parkhi et al., 2012), FGVC Aircraft (100 classes) (Maji et al., 2013), Stanford Cars (196 classes) (Krause et al., 2013) and CUB200 (200 classes) (Wah et al., 2011) have all previously been used to benchmark interpretable computer vision models. These are used to test the performance of Com Fe, in addition to other datasets including Image Net-1K (1000 classes) (Russakovsky et al., 2015), CIFAR-10 (10 classes), CIFAR-100 (100 classes) (Krizhevsky et al., 2009), Flowers-102 (102 classes) (Nilsback & Zisserman, 2008) and Food-101 (101 classes) (Bossard et al., 2014). |
| Hardware Specification | Yes | For the CUB200 dataset, Com Fe can train a Vi T-L/14 model in thirty minutes on one 80GB NVIDIA A100 GPU, and has a sufficiently small memory footprint that it could be trained on most mid-range consumer graphics cards. This research was undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. This research was also undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. |
| Software Dependencies | No | Standard choices for training transformers are used, such as the Adam W optimiser (Loshchilov & Hutter, 2017b), cosine learning rate decay with linear warmup (Loshchilov & Hutter, 2017a; Gotmare et al., 2019) and gradient clipping (Pascanu et al., 2013). For image augmentations, we follow DINOv2 and other works, including random cropping, flipping, color distortion and random greyscale (Chen et al., 2020a; Oquab et al., 2024). |
| Experiment Setup | Yes | The same set of hyperparameters are used across all of the training runs, with the exception of the batch size which is increased from 64 image to 1024 for only the Image Net dataset. The number of epochs for Image Net (Russakovsky et al., 2015) is also reduced, from 50 epochs per training run to 20 epochs. All results use an input resolution of 224 224 pixels, with outputs upsampled using bilinear interpolation to generate pixel-level results from patch-level predictions. For each dataset, a total of five image prototypes P giving Q five rows and 6c class prototypes C are used, where c is the number of classes in the dataset. Similar temperature parameters are used to previous works (Assran et al., 2021), with τ1 = 0.1, τ2 = 0.02 and τc = 0.02. An ablation study on these hyperparameters is undertaken in Section D of the supporting information. |