Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CF-VLM:CounterFactual Vision-Language Fine-tuning

Authors: jusheng zhang, Kaitong Cai, Yijia Fan, Jian Wang, Keze Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiment We systematically evaluate our CF-VLM for vision-language models, i.e., Qwen-VL (7B)[44] and CLIP-Vi T-B/32, with a focus on enhancing complex visual reasoning. The experimental setup covers pretraining and fine-tuning on filtered CC12M [45] (8.6M image-text pairs) and the CC3M [46] subset (2.6M pairs), with MSCOCO Captions [47] (120K pairs) optionally included for analysis. For each image-text pair, one counterfactual sample is generated by our dynamic counterfactual generation (DCF) strategy, effectively doubling the training data. Evaluation is conducted on compositional and generalization benchmarks, including the Conme [48] , ARO [49], VLChecklist [50], Image Net-1k [51] (zero-shot classification), and MSCOCO/Flickr30k [52] (zero-shot retrieval).
Researcher Affiliation	Collaboration	Jusheng Zhang1, Kaitong Cai1, Yijia Fan1, Jian Wang2, Keze Wang1, 1Sun Yat-sen University 2Snap Inc. Corresponding author: EMAIL
Pseudocode	No	The paper describes the methodology and loss functions in detail using natural language and mathematical equations, but it does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	No	As stated in Appendix E, the authors plan to release the dataset upon official publication pending copyright clearance; code availability is not explicitly linked in the current draft. The paper states "Our CF-VLM provides a robust foundation for deploying VLMs in high-stakes, real-world scenarios requiring reliable reasoning and interpretability code." which is a general statement not an explicit release statement.
Open Datasets	Yes	Our training and evaluation are conducted on a series of standardized image-text datasets and structured evaluation benchmarks, as detailed below: Training Data: We utilize the cleaned version of CC12M (8.6M image-text pairs) and its subset CC3M (2.6M pairs) as our primary training corpora. All samples undergo standardized preprocessing, including image resizing to 224 224, text normalization, and language alignment to ensure stable input distribution. MSCOCO Captions (120K pairs): This dataset is not used for training, but is optionally included for auxiliary analysis tasks (e.g., image-text retrieval or structural generalization), to assess the transferability of our method beyond the core training corpus. Evaluation Benchmarks: We evaluate the model s compositional reasoning and crossmodal generalization on the following benchmarks: Con Me Suite, ARO (Attribute Relation Object), VL-Checklist, Image Net-1k (Zero-shot Classification), MSCOCO / Flickr30k (Image-Text Retrieval).
Dataset Splits	Yes	For pretraining/fine-tuning on CC12M/CC3M, we used a 99% training and 1% validation split, randomly sampled. For evaluation on downstream tasks, we adhered to their standard publicly available splits. For Image Net-1k, standard validation set images were used for zero-shot classification. For MSCOCO and Flickr30k, we utilized the Karpathy splits for zero-shot retrieval. During training, each batch maintains a fixed 1:4 ratio between factual and counterfactual samples, to enhance the model s sensitivity to semantically critical perturbations.
Hardware Specification	Yes	Models are trained using Adam W (β1 = 0.9, β2 = 0.98, ϵ = 1 10 6), a peak learning rate of 1 10 5 (cosine decay, 500 warmup steps), weight decay 0.1, and batch size 256 on a single NVIDIA A100 (bf16). Hardware: Experiments were primarily run on NVIDIA A100 GPUs. Depending on model size and batch configuration, between 1 to 8 GPUs were utilized per experimental run.
Software Dependencies	Yes	Software: Key software libraries included Python 3.9+, Py Torch 1.13.1 (with CUDA 11.7), Transformers 4.28.1, and a customized fork of Open CLIP for certain baseline implementations. Standard scientific computing libraries such as Num Py 1.23.5 and Pandas 1.5.3 were used for data handling and analysis. Operating System: All nodes ran Ubuntu 20.04 LTS.
Experiment Setup	Yes	This section provides a comprehensive overview of the experimental setup, including hardware and software configurations, dataset preparation specifics, and detailed hyperparameter settings used for training and evaluating our proposed CF-VLM framework and all baseline models. Table 5: Core hyperparameter settings for CF-VLM fine-tuning. Values were kept consistent across different base models (Qwen-VL, LLa VA-1.5) unless specified.