Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Flexible Nadaraya-Watson Head Can Offer Explainable and Calibrated Classification

Authors: Alan Q. Wang, Mert R. Sabuncu

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results on an array of computer vision tasks demonstrate that the NW head can yield better calibration with comparable accuracy compared to its parametric counterpart, particularly in data-limited settings. To further increase inference-time efficiency, we propose a simple approach that involves a clustering step run on the training set to create a relatively small distilled support set. Furthermore, we explore two means of interpretability/explainability that fall naturally from the NW head. The first is the label weights, and the second is our novel concept of the support influence function, which is an easy-to-compute metric that quantifies the influence of a support element on the prediction for a given query. As we demonstrate in our experiments, the influence function can allow the user to debug a trained model.
Researcher Affiliation	Academia	Alan Q. Wang EMAIL School of Electrical and Computer Engineering Cornell University Mert R. Sabuncu EMAIL School of Electrical and Computer Engineering Cornell University
Pseudocode	No	The paper describes methods and derivations using mathematical equations (e.g., Eq. 3, 4, 5, 6, 7, 8) but does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code	Yes	Our code is available at https://github.com/alanqrwang/nwhead.
Open Datasets	Yes	Datasets. We experiment with an array of computer vision datasets with different class diversity and training set size. For general image classification, we experiment with Cifar-100 (Krizhevsky, 2009). For fine-grained image classification with small to medium training set size (less than 12k training samples), we experiment with CUB-200-2011 (Bird-200) (Wah et al., 2011), Stanford Dogs (Dog-120) (Khosla et al., 2011), Oxford Flowers (Flower-102) (Nilsback & Zisserman, 2008), and FGVC-Aircraft (Aircraft-100) (Maji et al., 2013). For a large-scale fine-grained image classification task (500k training samples), we experiment with i Naturalist-10k (Grant Van Horn, 2021). Additional dataset details are provided in Appendix A.1.
Dataset Splits	Yes	For Cifar-100, Flower-102, Aircraft-100, and i Naturalist-10k, we use the implementation in the torchvision package. For i Naturalist-10k, we use the "mini" training dataset from the 2021 competition, which has balanced images per class, and test on the validation set (the test set is not provided). For Bird-2002 and Dog-1203, we pull the train/test splits from the dataset websites.
Hardware Specification	Yes	All training and inference is done on an Nvidia A6000 GPU and all code is written in Pytorch4.
Software Dependencies	No	The paper mentions "Pytorch" as the framework used for coding: "All training and inference is done on an Nvidia A6000 GPU and all code is written in Pytorch4.", but it does not specify a version number for Pytorch or any other software dependency.
Experiment Setup	Yes	For NW training, we use SGD with momentum 0.9, weight decay 1e-4, and an initial learning rate of 1e-3. The learning rate is divided by 10 after 20K and 30K gradient steps, and training stops after 40K gradient steps. For all datasets except i Naturalist-10k, we set the randomly sampled support size to be Ns = 10 for all datasets, and the mini-batch size to be Nb = 32 Cifar-100 and Nb = 4 for Bird-200, Dog-120, Flower-102, and Aircraft-100. That is, we sample a unique support set for each query, following Eq. 6. For i Naturalist10k, we use a mini-batch size Nb = 128 and support size Ns = 250. We sample a single support set for each mini-batch (instead of each query) for computational efficiency. Effectively, this makes the total number of images per mini-batch to be Ns + Nb instead of Ns Nb. Note that there is very little effect of this change on training because the number of unique pairwise similarities is preserved.