Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

Authors: Pei Peng, Ming-Kun Xie, Hang Hao, Tong Jin, Sheng-Jun Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. ... We present our method achieves state-of-the-art zero-shot recognition performance across multiple context-sensitive datasets, significantly improving model s reliability under distribution shifts. ... 4 Experiment Settings. We evaluate our method on four widely adopted benchmarks that target context-sensitive distribution shifts: Waterbirds [8], Urban Cars [9], COCO-GB [10], and NICO [11].
Researcher Affiliation	Academia	1 Nanjing University of Aeronautics and Astronautics, Nanjing, China EMAIL
Pseudocode	Yes	Algorithm 1 reflects our inference method combining representation-level counterfactual construction and total direct effect computation. ... Algorithm 1: INFERENCE WITH TDE AND REPRESENTATION-LEVEL COUNTERFACTUAL CALIBRATION
Open Source Code	Yes	The implementation is available at https://github.com/peipeng98.
Open Datasets	Yes	We evaluate our method on four widely adopted benchmarks that target context-sensitive distribution shifts: Waterbirds [8], Urban Cars [9], COCO-GB [10], and NICO [11]. ... External scene datasets (e.g. Places-365 [36]): It samples dataset directly by CLIP image encoder to get fi(z). ... pretrained on the LAION-2B-en dataset [26], which contains over 2.3 billion image text pairs.
Dataset Splits	Yes	Waterbirds [8]. ... It comprises a total of 10,589 images, including 4,795 training images and 5,794 test images... Urban Cars [9]. ... The training set contains 8,000 images, with strong three-way correlations between class, background, and co-occurring object. Both validation and test sets contain 1000 balanced samples where the spurious cues have extra tags which enables fine-grained analysis of shortcut reliance and interaction among multiple biases.
Hardware Specification	No	For instance, on a single GPU, we can generate 100,000 counterfactual embeddings for a batch of 1,000 samples in less than 3 seconds... Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We disclose our experimental setting in Appendix C.2.
Software Dependencies	No	We use four CLIP vision backbones: Vi T-B/32, Vi T-B/16, Vi T-L/14, and Vi T-H/14, all publicly available via Open CLIP and pretrained on the LAION-2B-en dataset [26]... The pretrained model are: laion2b_s34b_b79k for Vi T-B/32, laion2b_s34b_b88k for Vi T-B/16, laion2b_s32b_b82k for both Vi T-L/14 and laion2b_s32b_b79k for Vi T-H/14.
Experiment Setup	Yes	Additional Hyper-parameters. During counterfactual synthesis we blend the object and background embeddings with a fusion weight α. We find α selects 0.5-0.7 yields the best trade-off. The coefficient λ in Eq. (19) regulates how much of the learned object context interaction is retained in the final score; values 0.6-0.8 consistently work well. In Eq. (15), we introduce an additional factor ˆλ to control the malignant hallucination term, normally selecting 1 unless stated otherwise. When forming a counterfactual image embedding in Eq. (11) we optionally discard token contributions whose probability is below a threshold, which can be tightened for unknown noisy. It sets around 0.3 usually provides a good balance. Finally, to limit inference overhead, the operation of Eq. (18) are applied only to the top-5 classes returned by the initial softmax; all other classes still receive the TDE correction.