Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bayesian Concept Bottleneck Models with LLM Priors

Authors: Jean Feng, Avni Kothari, Lucas Zier, Chandan Singh, Yan Shuo Tan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across image, text, and tabular datasets, BC-LLM outperforms interpretable baselines and even black-box models in certain settings, converges more rapidly towards relevant concepts, and is more robust to out-of-distribution samples. 1 Introduction
Researcher Affiliation	Collaboration	Jean Feng, Avni Kothari, Luke Zier University of California, San Francisco Chandan Singh Microsoft Research Yan Shuo Tan National University of Singapore
Pseudocode	Yes	Algorithm 1 Metropolis-within-Gibbs 1: Initialize concept set c = (c1, , c K) 2: List of concept sets L = [] 3: for t = 1, 2, . . . , T do 4: for k = 1, 2, . . . , K do 5: ck MH-UPDATE(c, k) // Update k-th concept 6: Append the current c to L 7: return L 8: function MH-UPDATE(c, k) 9: Propose concept ˇck Q(Ck; c k) 10: α min n p((c k,ˇck)\|y,X)Q(ck;c k) p(c\|y,X)Q(ˇck;c k) , 1 o 11: if Accept with probability α then return ˇck 12: else return ck
Open Source Code	Yes	Code for running BC-LLM and reproducing results in the paper are available at https://github.com/ jjfeng/bc-llm.
Open Datasets	Yes	We evaluated BC-LLM in across three domains and modalities: classifying birds in images (Section 4.1), simulated outcomes from clinical notes (Section 4.2), and readmission risk in real-world clinical data (Section 4.3). The experiments below used GPT-4o-mini [57] and Section 4.3 used a protected health information-compliant version of GPT-4o for real-world clinical notes.
Dataset Splits	Yes	The data is split 50/50 between training and testing. CBMs with K=6 were trained on 100 to 800 observations. CBMs were trained on 1000 patients and evaluated on 500 held-out patients.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory used for running the experiments. It discusses computational cost in terms of LLM queries and time taken per iteration, but not the underlying hardware.
Software Dependencies	No	The paper mentions using specific LLM models like 'GPT-4o-mini [57]' and 'GPT-4o' but does not provide details on other ancillary software dependencies, such as programming languages (e.g., Python), libraries (e.g., PyTorch, scikit-learn), or their specific version numbers.
Experiment Setup	Yes	The number of concepts K learned for each task was set to the number of classes, but no smaller than 4 and no greater than 10. For the black-box comparator, we fine-tuned the last layer of Res Net50 pre-trained on Image Net V2 [60, 61]. Human+CBM was trained on the 312 human-annotated features available in the CUB-birds dataset. Fraction ω of data used for partial posterior: Choosing a small ω may lead to the LLM proposing less relevant concepts, but tends to lead to more diverse proposals. In contrast, a large ω tends to lead to less diverse proposals because the LLM is encouraged to propose concepts that are relevant to the dataset D, which may not necessarily generalize. In experiments, we found that ω = 0.5 provided good results. Number of candidate concepts M: ... We found that setting M = 10 provided good performance. Warm-start and Burn-in: Since Gibbs sampling can be slow to converge, we precede it with a warm-start, which we obtain by updating concepts greedily. That is, we select the concept that maximizes argmax p(γ\|c k, y S, X), instead of sampling from the distribution. In experiments, we run warm-start for one epoch and stored the last 20 iterates as posterior samples; the rest of the samples were treated as burn-in. Number of iterations T: ... our experiments all use T = 5.