Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Interactive Concept Bottleneck Models

Authors: Kushal Chauhan, Rishabh Tiwari, Jan Freyberg, Pradeep Shenoy, Krishnamurthy Dvijotham

AAAI 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that a simple policy combining concept prediction uncertainty and inﬂuence of the concept on the ﬁnal prediction achieves strong performance and outperforms static approaches as well as active feature acquisition methods proposed in the literature. We show that the interactive CBM can achieve accuracy gains of 5-10% with only 5 interactions over competitive baselines on the Caltech-UCSD Birds, Che Xpert and OAI datasets. 4 Experiments
Researcher Affiliation	Industry	Kushal Chauhan, Rishabh Tiwari, Jan Freyberg, Pradeep Shenoy, Krishnamurthy Dvijotham Google Research EMAIL
Pseudocode	Yes	Algorithm 1: Policy Rollout
Open Source Code	Yes	1Code is available at https://github.com/google-research/google research/tree/master/interactive cbms
Open Datasets	Yes	CUB (Caltech-UCSD Birds): This dataset contains pictures of birds coupled with human-labeled concept attributes identifying prominent characteristics (wing color, beak length, undertail color, etc.) (Wah et al. 2011). CHEXPERT: This dataset contains chest X-rays accompanied by binary concept labels extracted from a report generated by a radiologist, with the goal of predicting whether the X-ray was normal or abnormal (Irvin et al. 2019).
Dataset Splits	Yes	For each experiment, we split the data into 3 sets: train, validation, and test the details are available in Table 1. ... Table 1: Details of the datasets used in our experiments. Data splits train 4,796 val 1,198 test 5,794
Hardware Specification	No	No specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments were mentioned in the paper.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., library or framework versions) were explicitly stated in the paper.
Experiment Setup	No	The paper describes the general training and evaluation process, including dataset splits and the types of CBMs used, but does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) in the main text.