Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Implicit Modeling for Transferability Estimation of Vision Foundation Models

Authors: Yaoyan Zheng, Huiqun Wang, Nan Zhou, Di Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on a comprehensive benchmark spanning extensive training regimes and a wider variety of model types demonstrate that ITM consistently outperforms existing methods in terms of stability, effectiveness, and efficiency.
Researcher Affiliation	Academia	Yaoyan Zheng Huiqun Wang Nan Zhou Di Huang 1State Key Laboratory of Complex and Critical Software Environment, Beihang University, Beijing, China 2School of Computer Science and Engineering, Beihang University, Beijing, China EMAIL
Pseudocode	Yes	A.2 Pseudo-code Algorithm 1 Training process of DVA without deparametric approximation Algorithm 2 Training process of DVA
Open Source Code	Yes	Code is available at https://github.com/BUAAHugeGun/ITM.
Open Datasets	Yes	We use 10 widely adopted single-label image classification datasets for transfer learning, primarily sourced through the official Py Torch [40] (e.g., torchvision.datasets): CIFAR-10, CIFAR-100 [41], FGVC Aircraft [42], Caltech-101 [43], DTD [44], Oxford-IIIT Pets [45], Stanford Cars [46], SUN-397 [47], Food-101 [48], and Oxford 102 Flowers [49].
Dataset Splits	Yes	During training, 4/5 of the official training split from each benchmark is randomly sampled for optimizing the ITM framework, while the remaining 1/5 is reserved for score calculation.
Hardware Specification	Yes	All experiments use a single NVIDIA V100 GPU (32 GB), with batch sizes of 64 for classification and 8 for segmentation. [...] All comparisons are conducted under a consistent environment with 8-core CPUs to ensure fairness.
Software Dependencies	No	All datasets are sourced from the official torchvision.datasets module of Py Torch [40]. [...] we follow official implementations [10, 51, 26, 11, 12] and adopt Adam W [52] to jointly fine-tune backbones and classification heads for 100 epochs.
Experiment Setup	Yes	Fine-tuning protocols. Fine-tuned performance is critical for accurate TE ranking. However, settings used in prior work [17, 4, 7, 6] often underexploit modern models, introducing evaluation bias. To address this, we follow official implementations [10, 51, 26, 11, 12] and adopt Adam W [52] to jointly fine-tune backbones and classification heads for 100 epochs. Learning rates are grid-searched over 10 5, 2 10 5, 5 10 5 and weight decays over 10 2, 10 4. Evaluation is performed every epoch, and the best checkpoint is used for ground-truth ranking. All experiments use a single NVIDIA V100 GPU (32 GB), with batch sizes of 64 for classification and 8 for segmentation. [...] To balance accuracy and efficiency, we set the ITM training iterations to 500. The transferability score s is evaluated every 100 iterations, with the highest score recorded as the final estimation for each model. During training, 4/5 of the official training split from each benchmark is randomly sampled for optimizing the ITM framework, while the remaining 1/5 is reserved for score calculation. The step size η in DVA is fixed at 0.01, and the iteration count n is determined adaptively (details provided in the appendix). Across all experiments, the learning rate α is set to 5 10 3, and Adam W [52] is adopted as the optimizer. All comparisons are conducted under a consistent environment with 8-core CPUs to ensure fairness.