Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Revisiting Semi-Supervised Learning in the Era of Foundation Models

Authors: Ping Zhang, Zheda Mai, Quang-Huy (Percy) Nguyen, Wei-Lun (Harry) Chao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a systematic study on tasks where frozen VFMs underperform and reveal several key insights when fine-tuning them. ... Extensive experiments validate the effectiveness of this approach, offering actionable insights into SSL with VFMs and paving the way for more scalable and robust semi-supervised learning in the foundation model era.
Researcher Affiliation	Academia	Ping Zhang Zheda Mai Quang-Huy Nguyen Wei-Lun Chao The Ohio State University EMAIL
Pseudocode	Yes	Algorithm 1: V-PET (Figure 2) Input :Labeled dataset L and unlabeled dataset U PEFT methods indexed by n [1, N] VFMs indexed by m [1, M] Initialized parameters of VFM on PEFT θn,m Output :Optimal parameter θ // (a) Supervised Parameter Efficient Fine-Tuning for n [1, N], m [1, M] do Fine-tune θn,m on L to obtain θn,m // (b) Pseudo-Label Generation for n [1, N], m [1, M] do Compute one-hot pseudo-label set Pn,m as Pn,m = one_hot arg max c f θn,m(u) u U . // (c) Pseudo-Label Ensemble Ensemble pseudo-label P := { pi}\|U\| i , where: n [1,N], m [1,M] P n, m[i] // (d) Self-Training Choose n [1, N], m [1, M] and fine-tune θn ,m on P to get θ
Open Source Code	Yes	Our code is available at https://github.com/OSU-MLB/SSL-Foundation-Models.
Open Datasets	Yes	To this end, we introduce new SSL benchmark datasets based on the Visual Task Adaptation Benchmark (VTAB) [76], a diverse suite of classification tasks designed to evaluate visual representations. ... We select 6 datasets with a diverse ranges of applications for our experiments, including: DTD [16]: The Describable Texture Dataset (DTD)... SUN397 [69]: SUN397 is a large-scale scene recognition benchmark... Resisc45 [14]: Resisc45 is a remote sensing image dataset... Diabetic-Retinopathy [22]: This dataset focuses on high-resolution retinal images... Clevr-Count [39]: Derived from the CLEVR family of datasets... KITTI-Dist [27]: KITTI-Dist is based on the renowned KITTI benchmark...
Dataset Splits	Yes	To evaluate the robustness of SSL methods, we vary the number of labeled samples per class and adopt linear probing as the evaluation protocol, with shot counts chosen to keep each task sufficiently challenging for frozen VFMs. ... A summary of the datasets appears in Table 2, and detailed descriptions are provided in Appendix A. ... Table 2: Summary of our benchmark. ... Dataset ... \|L\|/Class ... DTD ... 3, 6 ... SUN397 ... 3, 6 ... RESISC45 ... 1, 2 ... Retinopathy ... 4, 8 ... CLEVR-C ... 1, 2 ... KITTI ... 5, 10 ... The detailed statistics of the datasets, specifically the validation and test set sizes, are presented in Table 5. Table 5: Dataset statistics. Dataset N v Test Size DTD 752 1,880 SUN397 17,401 21,750 Resisc45 5,040 6,300 Retinopathy 9,207 42670 Clevr-C 14,000 15,000 KITTI 1,354 711
Hardware Specification	Yes	Our experiments were conducted on a workstation equipped with eight NVIDIA RTX 6000 Ada GPUs, two AMD EPYC 9554 64-Core Processors, and 800GB of RAM. Additionally, we utilized NVIDIA Tesla V100, NVIDIA Tesla A100, and NVIDIA RTX H100 GPUs for certain experiments.
Software Dependencies	No	All experiments were implemented using Py Torch [54].
Experiment Setup	Yes	For all the experiments, we employ Adam W [44] optimizer with a cosine annealing learning rate scheduler [43] with warm-up period with 2.5% of total iterations. We use a batch size of 32 + 32 for all experiments, where the first 32 corresponds to labeled data and the second 32 corresponds to unlabeled data. We list the hyperparameter search spaces for the SSL, Labeled-Only, and ST settings in Table 6, including the drop path rate (dpr) [34], training augmentation (train-aug), Lo RA dimension (lora_dim), adapter bottleneck size (adapter_bottleneck), weight decay, momentum, learning rate (lr), and number of epochs. For other unmentioned SSL algorithm related hyperparameters, we use the default values provided in the original papers. ... Table 6: Hyperparameter search spaces for SSL, Labeled-Only, and ST settings.