Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CHiQPM: Calibrated Hierarchical Interpretable Image Classification

Authors: Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Neslihan Kose, Ramesh Manuvinakurike, Bodo Rosenhahn

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments This section discusses the main quantitative results of our proposed method. Further qualitative examples are included in the appendix. Figures 14 to 20 showcase global explanations, Figures 21 to 30 include the novel local hierarchical explanations and Figures 31 to 34 demonstrate how the features of CHi QPM are general concept detectors. The accuracy as point predictor along the generally preferable qualities of Compactness, Contrastiveness and Structural Grounding is shown in Table 1. CHi QPM shows state-of-the-art accuracy for compact point predictors. Further, it scores nearly perfectly on Contrastiveness. CHi QPM learns features that can be more clearly separated between active and inactive than even the class detectors of PIP-Net [38], indicating a gap between the Re LU-induced minimum of 0 and the activations where a relevant concept is found. The clear distinction between active and inactive enables our saliency maps, like in Figures 1 and 2, to also transport activation rather than just location without a reference test image and therefore enables extensive local explanations in practice. The details are explained in Appendix C.2. Finally, Structural Grounding quantifies that the additionally added pairs via Equation (31) are also similar in reality and thus lead to more grounded class representations. The state-of-the-art accuracy as point predictor paves the way for accurate set prediction along the hierarchical explanation, as the sets are conditioned on the predicted class. For calibrating Conformal Prediction methods, the first 10 test examples per class are split off into the calibration data Dcal. That way, we can use the same models for evaluating point and set prediction. Notably, applying Split Conformal Predictions requires exchangeability between calibration and test data. Our experimental setup is designed to ensure this exchangeability, as detailed in Appendix L. As comparable CP methods, THR [50] and APS [46] are used, as they are applicable without hyperparameters and broadly used [7]. Table 2 compares our built-in CP method with these and also with the two simpler nonconformity scores ssel and sup. Evidently, our proposed nonconformity score that restricts the sets to be constructed by going up the hierarchical local explanations shows competitive efficiency to THR for higher error rate α and approaches APS for lower values.
Researcher Affiliation	Collaboration	Thomas Norrenbrock, Timo Kaiser & Bodo Rosenhahn Institute for Information Processing (tnt) L3S Leibniz Universität Hannover, Germany EMAIL Sovan Biswas & Neslihan Kose Intel Labs, Germany EMAIL Ramesh Manuvinakurike Intel Labs, USA EMAIL
Pseudocode	No	The paper describes methods and processes (e.g., in Section 3, Figure 4 showing an overview of the pipeline) but does not present them in formal pseudocode or algorithm blocks. The pipeline in Figure 4 is a flowchart, not pseudocode.
Open Source Code	Yes	Our main contributions1 are: We present the Calibrated Hierarchical QPM (CHi QPM). It is based on a heavily constrained discrete quadratic problem (QP), that selects features from a black-box model and assigns them to classes. The features of CHi QPM then adapt to the optimal solution, resulting in a globally and locally interpretable model. CHi QPM offers novel hierarchical local explanations and can be calibrated to reach a target coverage with competitive efficiency while ascending through its dynamically constructed interpretable class hierarchy and selecting the appropriate level. Thus, CHi QPM can be considered an interpretable conformal predictor. We present the Feature Grounding Loss Lfeat, which, alongside an additional Re LU, leads to learning more grounded and sparser features that facilitate compact hierarchical explanations along more human concepts. The state-of-the-art performance of CHi QPM as pointand built-in interpretable calibrated coherent set-predictor is evaluated across multiple architectures and datasets, including Image Net-1K [48], where the gap to the black-box baseline is more than halved. 1The code is published: https://github.com/Thomas Norr/CHi QPM/.
Open Datasets	Yes	Following QPM (Section 2.2), we evaluate our method on CUB-2011, Stanford Cars [26] and Image Net-1K. CUB-2011 and Stanford Cars are the most commonly used datasets for interpretability, while Image Net-1K is suitable to demonstrate how the method scales to larger problems with more real-world applications. CUB-2011 includes human annotations of relevant concepts for every image, which makes it suitable for evaluating the alignment between human representations and the ones learned.
Dataset Splits	Yes	For calibrating Conformal Prediction methods, the first 10 test examples per class are split off into the calibration data Dcal. That way, we can use the same models for evaluating point and set prediction. Notably, applying Split Conformal Predictions requires exchangeability between calibration and test data. Our experimental setup is designed to ensure this exchangeability, as detailed in Appendix L.
Hardware Specification	Yes	As GPU ressource, this work made use of an internal cluster composed of several NVIDIA RTX 2080 Ti. Every experiment fit on one GPU. As CPU ressource for solving the QP, this paper used an internal CPU cluster composed of primarily AMD EPYC 72F3 and up to 250GB of ram.
Software Dependencies	No	For implementing CP, we utilized the torchcp [64] package and implemented the model using Pytorch [44]. Note that all details will also be clear in the published code. The QP is solved as described, with the additional constraints from Section 3.1, but with two relaxations based on observations: First, the MIP-Gap of the discrete optimization in Gurobi [17] can be relaxed without significant effect on the resulting metrics, hence we set it to 1%.
Experiment Setup	Yes	As usual in literature, nwc = 5 and n f = 50 are set if not reported otherwise. Further, we generally set the density parameter for our class hierarchy to ρ = 0.5, as it is sufficient to demonstrate the improvements in built-in set prediction without sacrificing accuracy as point-predictor. Finally, λfeat = 3 is set as higher values cause reduced accuracy.