Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CONDA: Adaptive Concept Bottleneck for Foundation Models Under Distribution Shifts

Authors: Jihye Choi, Jayaram Raghuram, Yixuan Li, Somesh Jha

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations with various real-world distribution shifts show our framework produces concept-based interpretations better aligned with the test data and boosts post-deployment accuracy by up to 28%... 4 EXPERIMENTS In this section, we conduct experiments to answer the following three research questions: RQ1: How effective is CONDA in improving the test-time performance of deployed classification pipelines...? Metrics. We report the performance in terms of two metrics: averaged group accuracy (AVG) and worst-group accuracy (WG). Table 1 presents our main results evaluating the effectiveness of CONDA on different real-world distribution shifts...
Researcher Affiliation	Academia	Jihye Choi, Jayaram Raghuram, Yixuan Li & Somesh Jha Department of Computer Sciences, University of Wisconsin Madison EMAIL
Pseudocode	Yes	Algorithm 1 CONDA: CONCEPT-BASED DYNAMIC ADAPTATION Inputs: Foundation model ϕ(x). Source domain CBM: Cs, Ws, bs. Concept scores distribution statistics: {(µy, Σy)}y Y. Unlabeled test dataset Dt.
Open Source Code	Yes	1The code repository for our work is available at https://github.com/jihyechoi77/CONDA.
Open Datasets	Yes	Datasets. We evaluate the performance of concept bottlenecks for FMs and the proposed adaptation on five real-world datasets with distribution shifts, following the setup in Lee et al. (2023): (1) CIFAR10 to CIFAR10-C and CIFAR100 to CIFAR100-C for low-level shift, (2) Waterbirds and Metashift for concept-level shift, and (3) Camelyon17 for natural shift. CIFAR10-C (Hendrycks & Dietterich, 2019)) Waterbirds, Metashift (Sagawa et al., 2019; Liang & Zou, 2021)
Dataset Splits	Yes	CIFAR10. It consists of 60k RGB images of size 32x32 (50k images for the train set, and 10k images for the test set) Metashift. For evaluation, we randomly split 90:10 equally across the correlation types Camelyon17. We use the train set (hospital 1-3) for source, and the test set (hospital 5) for the target.
Hardware Specification	Yes	All the experiments are run on a server with thirty-two AMD EPYC 7313P 883 16-core processors, 528 GB of memory, and four 884 Nvidia A100 GPUs. Each GPU has 80 GB of 885 memory.
Software Dependencies	No	The paper mentions using optimizers like Adam and SGD in Table 4, but does not specify version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software tools critical for reproduction.
Experiment Setup	Yes	Table 4: Summary of the hyper-parameters used in our experiments. Dataset Backbone Batch Size # Epochs lr (CSA, LPA, RCB) Adaptation steps {λfrob, λsparse, λsim, λcoh} CIFAR10 CLIP:Vi T-L-14 (FARE2) 128 50 Adam, 0.01 20 {0.1, 1.0, 0.1, 2.0}