Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

Authors: Xudong Yan, Songhe Feng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at https://github.com/xud-yan/TOMCAT.4 Experiment4.1 Experiment SetupDatasets. Our proposed TOMCAT is evaluated on four commonly used datasets: UT-Zappos [61], MIT-States [15], C-GQA [40], and Clothing16K [68]. Metrics. Following the evaluation protocol of previous works [46, 35, 42], a bias term from to + is introduced to trade off the prediction logits between seen and unseen compositions. By varying the bias term, we calculate the best Seen accuracy, best Unseen accuracy, best Harmonic Mean (HM) of seen and unseen accuracies [70], and the Area Under the Curve (AUC) drawn with seen and unseen accuracies. In the open-world setting, a post-training feasibility calibration is applied to filter out infeasible compositions within a vast search space [35]. Implementation Details. We implement the base model with CLIP Vi T-L/14 architecture in the training phase and TOMCAT at test time in Py Torch [45] framework on a single NVIDIA RTX 3090 GPU. Refer to Appendix D for more implementation details. The source code will also be released at this website to provide all implementation details and thus facilitate reproducibility. Baselines. We compare TOMCAT with recent and prominent approaches on UT-Zappos [61], MITStates [15], and C-GQA [40], including CLIP [49], Co Op [73], CLIP-based Co-CGE [36], DFSP [34], Table 1: Closed-world and open-world results on UT-Zappos, MIT-States, and C-GQA. The best results are displayed in boldface, and the second-best results are underlined. The four indicators are explained in Metrics (Sec. 4.1). In the open-world setting, we report the results of CDS-CZSL [28] using the same post-training feasibility calibration [35] as our TOMCAT and other baselines use.4.3 Ablation StudyIn this section, we conduct extensive ablation studies to evaluate the contribution of each component within our proposed TOMCAT on UT-Zappos, MIT-States, and C-GQA in the closed-world setting. Ablation Study of Main Modules. According to Table 3, for MIT-States, empirical results indicate that all proposed modules contribute to the performance improvement of TOMCAT, including the priority queue, multimodal KAMs, and the adaptive update weights. Ablation Study of Each Loss. As shown in Table 4, the improvement of adding LP E to the base model indicates that prediction entropy loss helps the model to adapt towards the new label distribution of seen and unseen compositions. Influence of Different Initialization Strategies of Multimodal KAMs. Table 5 shows that the zero-initialization outperforms other random initialization strategies. Influence of Test Order. Since TOMCAT continually accumulates knowledge during testing, the test order may affect the results. We conduct three experiments with different random seeds to vary the sample order. The results in Table 6 suggest that there exists performance variance among different orders, although the differences are not statistically significant. Influence of Hyperparameters. In Fig. 3 and Fig. 4, we explore the influence of the number of images stored in priority queue K and the update control factor θ respectively on UT-Zappos and MITStates.
Researcher Affiliation	Academia	Xudong Yan 1,2 Songhe Feng 1,2 1 School of Computer Science and Technology, Beijing Jiaotong University 2 Key Laboratory of Big Data and Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education EMAIL
Pseudocode	No	The paper describes its methodology in natural language and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Code will be available at https://github.com/xud-yan/TOMCAT. The source code will also be released at this website to provide all implementation details and thus facilitate reproducibility. The source code of our method will be released at https://github.com/xud-yan/TOMCAT.
Open Datasets	Yes	Our proposed TOMCAT is evaluated on four commonly used datasets: UT-Zappos [61], MIT-States [15], C-GQA [40], and Clothing16K [68]. Datasets. UT-Zappos [61] and MIT-States [15] are two early-proposed datasets commonly used in CZSL. Their creators and owners have not declared any license and have allowed non-commercial research use. C-GQA [40] is under CC BY 4.0 license. Clothing16K [68] is under CC0 license.
Dataset Splits	Yes	The detailed introduction and common data splits of the four datasets are presented in Appendix C. Table 7: Summary statistics of the four datasets used in our experiments. Dataset Composition Train Validation Test \|A\| \|O\| \|A O\| \|Cs\| \|X\| \|Cs\| \|Cu\| \|X\| \|Cs\| \|Cu\| \|X\| UT-Zappos [61] 16 12 192 83 22998 15 15 3214 18 18 2914 MIT-States [15] 115 245 28175 1262 30338 300 300 10420 400 400 12995 C-GQA [40] 413 674 278362 5592 26920 1252 1040 7280 888 923 5098 Clothing16K [68] 9 8 72 18 7242 10 10 5515 9 8 3413
Hardware Specification	Yes	We implement the base model with CLIP Vi T-L/14 architecture in the training phase and TOMCAT at test time in Py Torch [45] framework on a single NVIDIA RTX 3090 GPU.
Software Dependencies	No	We implement the base model with CLIP Vi T-L/14 architecture in the training phase and TOMCAT at test time in Py Torch [45] framework on a single NVIDIA RTX 3090 GPU. The paper mentions 'Py Torch [45]' but does not specify a version number for PyTorch or other software dependencies.
Experiment Setup	Yes	Implementation Details. We implement the base model with CLIP Vi T-L/14 architecture in the training phase and TOMCAT at test time in Py Torch [45] framework on a single NVIDIA RTX 3090 GPU. Refer to Appendix D for more implementation details. The source code will also be released at this website to provide all implementation details and thus facilitate reproducibility. Table 8: Hyperparameter settings for UT-Zappos, MIT-States, C-GQA, and Clothing16K. Hyperparameters UT-Zappos MIT-States C-GQA Clothing16K The Base Model (Training Phase) Batch Size 128 64 16 128 Epochs 20 20 20 20 Prompt Dropout Rate 0.3 0.3 0 0.3 Adapter Downsampling Dimension 64 64 64 64 Adapter Dropout 0.1 0.1 0.1 0.1 Optimizer Adam Adam Adam Adam Optimizer-Weight Decay 1e-5 1e-4 1e-5 1e-5 Optimizer-Learning Rate 5e-4 1e-4 1e-4 5e-4 Scheduler Step LR Step LR Step LR Step LR Scheduler-Step Size 5 5 5 5 Scheduler-Gamma 0.5 0.5 0.5 0.5 TOMCAT (Test Phase) Batch Size 1 1 1 1 Image Number of Priority Queue 3 3 3 3 Optimizer Adam W Adam W Adam W Adam W Optimizer-Epsilon 1e-3 1e-3 1e-3 1e-3 Optimizer-Weight Decay 1e-3 1e-4 1e-4 1e-3 Optimizer-Learning Rate 5e-6 1e-6 6.25e-6 5e-6 α 0 1.25 0.5 0.25 β 10 10 7.5 θ 1 1.5 2 1 λ 3.5 2.5 1.75 3.5