Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A Flexible Nadaraya-Watson Head Can Offer Explainable and Calibrated Classification
Authors: Alan Q. Wang, Mert R. Sabuncu
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results on an array of computer vision tasks demonstrate that the NW head can yield better calibration with comparable accuracy compared to its parametric counterpart, particularly in data-limited settings. To further increase inference-time efficiency, we propose a simple approach that involves a clustering step run on the training set to create a relatively small distilled support set. Furthermore, we explore two means of interpretability/explainability that fall naturally from the NW head. The first is the label weights, and the second is our novel concept of the support influence function, which is an easy-to-compute metric that quantifies the influence of a support element on the prediction for a given query. As we demonstrate in our experiments, the influence function can allow the user to debug a trained model. |
| Researcher Affiliation | Academia | Alan Q. Wang EMAIL School of Electrical and Computer Engineering Cornell University Mert R. Sabuncu EMAIL School of Electrical and Computer Engineering Cornell University |
| Pseudocode | No | The paper describes methods and derivations using mathematical equations (e.g., Eq. 3, 4, 5, 6, 7, 8) but does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/alanqrwang/nwhead. |
| Open Datasets | Yes | Datasets. We experiment with an array of computer vision datasets with different class diversity and training set size. For general image classification, we experiment with Cifar-100 (Krizhevsky, 2009). For fine-grained image classification with small to medium training set size (less than 12k training samples), we experiment with CUB-200-2011 (Bird-200) (Wah et al., 2011), Stanford Dogs (Dog-120) (Khosla et al., 2011), Oxford Flowers (Flower-102) (Nilsback & Zisserman, 2008), and FGVC-Aircraft (Aircraft-100) (Maji et al., 2013). For a large-scale fine-grained image classification task (500k training samples), we experiment with i Naturalist-10k (Grant Van Horn, 2021). Additional dataset details are provided in Appendix A.1. |
| Dataset Splits | Yes | For Cifar-100, Flower-102, Aircraft-100, and i Naturalist-10k, we use the implementation in the torchvision package. For i Naturalist-10k, we use the "mini" training dataset from the 2021 competition, which has balanced images per class, and test on the validation set (the test set is not provided). For Bird-2002 and Dog-1203, we pull the train/test splits from the dataset websites. |
| Hardware Specification | Yes | All training and inference is done on an Nvidia A6000 GPU and all code is written in Pytorch4. |
| Software Dependencies | No | The paper mentions "Pytorch" as the framework used for coding: "All training and inference is done on an Nvidia A6000 GPU and all code is written in Pytorch4.", but it does not specify a version number for Pytorch or any other software dependency. |
| Experiment Setup | Yes | For NW training, we use SGD with momentum 0.9, weight decay 1e-4, and an initial learning rate of 1e-3. The learning rate is divided by 10 after 20K and 30K gradient steps, and training stops after 40K gradient steps. For all datasets except i Naturalist-10k, we set the randomly sampled support size to be Ns = 10 for all datasets, and the mini-batch size to be Nb = 32 Cifar-100 and Nb = 4 for Bird-200, Dog-120, Flower-102, and Aircraft-100. That is, we sample a unique support set for each query, following Eq. 6. For i Naturalist10k, we use a mini-batch size Nb = 128 and support size Ns = 250. We sample a single support set for each mini-batch (instead of each query) for computational efficiency. Effectively, this makes the total number of images per mini-batch to be Ns + Nb instead of Ns Nb. Note that there is very little effect of this change on training because the number of unique pairwise similarities is preserved. |