Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer

Authors: Hangzhou He, Lei Zhu, Xinliang Zhang, Shuang Zeng, Qian Chen, Yanye Lu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our V2C-CBM has matched or outperformed LLM-supervised CBMs on various visual classification benchmarks, validating the efficacy of our approach. Our contributions can be summarized as follows. 1. We propose the V2C tokenizer to discover visual concepts directly from images, avoiding the use of LLMs. 2. We adopt common words as our concept vocabulary and develop a concept filtering method to remove non-visual and irrelevant concepts using auxiliary unlabeled images. 3. The V2C-CBM that is built on the vision-oriented concepts generated by our V2C tokenizer can achieve high classification accuracy across various datasets with visually interpretable concepts. Experimental Setup We choose the following datasets for evaluation: CIFAR10, CIFAR100 (Krizhevsky and Hinton 2009), Image Net (Russakovsky et al. 2015) as the standard benchmarks for image classification; Aircraft (Maji et al. 2013), CUB (Wah et al. 2011), Flower (Nilsback and Zisserman 2008), and Food101 (Bossard, Guillaumin, and Gool 2014) for fine-grained image classification; DTD (Cimpoi et al. 2014) for texture classification; RESISC45 (Cheng, Han, and Lu 2017) for remote sensing scene classification; and HAM10000 (Tschandl, Rosendahl, and Kittler 2018) for skin tumor classification. We also use the same few-shot images and settings as La Bo and CLIP for a fair comparison. The classification accuracy on the test set is reported. In Table 1, we present the classification accuracy of our V2C-CBM on ten datasets. Figure 3 illustrates the few-shot performance of our V2CCBM on 9 datasets. Ablation Study In Table 4, we investigate the impact of the size of the unlabeled image set on the final performance of V2C-CBM.
Researcher Affiliation	Academia	Hangzhou He1,2,3,4, Lei Zhu1,2,3,4, Xinliang Zhang1,2,3,4, Shuang Zeng1,2,3,4, Qian Chen1,2,3,4, Yanye Lu1,2,3,4* 1Institute of Medical Technology, Peking University, Beijing, China 2Department of Biomedical Engineering, Peking University, Beijing, China 3National Biomedical Imaging Center, Peking University, Beijing, China 4Institute of Biomedical Engineering, Peking University Shenzhen Graduate School, Shenzhen, China EMAIL, zhangxinliang EMAIL, EMAIL
Pseudocode	No	The paper describes the method using a diagram (Figure 2) and detailed textual explanations of the steps, but does not include a formal pseudocode block or algorithm listing.
Open Source Code	Yes	Code https://github.com/riverback/V2C-CBM
Open Datasets	Yes	We choose the following datasets for evaluation: CIFAR10, CIFAR100 (Krizhevsky and Hinton 2009), Image Net (Russakovsky et al. 2015) as the standard benchmarks for image classification; Aircraft (Maji et al. 2013), CUB (Wah et al. 2011), Flower (Nilsback and Zisserman 2008), and Food101 (Bossard, Guillaumin, and Gool 2014) for fine-grained image classification; DTD (Cimpoi et al. 2014) for texture classification; RESISC45 (Cheng, Han, and Lu 2017) for remote sensing scene classification; and HAM10000 (Tschandl, Rosendahl, and Kittler 2018) for skin tumor classification.
Dataset Splits	Yes	We also use the same few-shot images and settings as La Bo and CLIP for a fair comparison. The classification accuracy on the test set is reported. We randomly sample images from the Image Net training set, and the default number of the unlabeled images is 200k. Figure 3 illustrates the few-shot performance of our V2CCBM on 9 datasets.
Hardware Specification	Yes	All experiments are conducted on an NVIDIA A100 80G PCIE graphics card using Py Torch.
Software Dependencies	No	All experiments are conducted on an NVIDIA A100 80G PCIE graphics card using Py Torch. We use CLIP Vi T-L/14 to build our V2C tokenizer and V2CCBM. For concept vocabulary, we use the English word frequency described in (Norvig 2009), and use NLTK library (Xue 2011) to determine adjectives and nouns to build the concept vocabulary. Adam (Kingma and Ba 2015) is used for optimization.
Experiment Setup	Yes	NC is set to 50 for all datasets. For each image, we select the top K (set to 5) concepts to update frequency. We then rank the word frequencies and select the top M (set to 500) words. Adam (Kingma and Ba 2015) is used for optimization, and the detailed hyperparameters are provided in the supplementary material. We use class names to extract the base features Fbase for each class.