Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Do not trust what you trust: Miscalibration in Semisupervised Learning

Authors: Shambhavi Mishra, Balamurali Murugesan, Ismail Ben Ayed, Marco Pedersoli, Jose Dolz

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments on a variety of SSL image classification benchmarks demonstrate that the proposed solution systematically improves the calibration performance of relevant SSL models, while also enhancing their discriminative power, being an appealing addition to tackle SSL tasks. Code : https://github.com/ShambhaviCodes/miscalibration-ssl. 5 Experiments Datasets. We resort to the recent Unified Semi-supervised Learning Benchmark for Classification (USB) (Wang et al., 2022), which compiles a diverse and challenging benchmark across several datasets.
Researcher Affiliation	Academia	1LIVIA, ÉTS Montréal, Canada 2International Laboratory on Learning Systems (ILLS), Mc GILL ETS MILA CNRS Université Paris-Saclay Centrale Supélec, Canada
Pseudocode	No	The paper describes mathematical formulations and algorithmic steps in prose and equations, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code	Yes	Code : https://github.com/ShambhaviCodes/miscalibration-ssl
Open Datasets	Yes	We resort to the recent Unified Semi-supervised Learning Benchmark for Classification (USB) (Wang et al., 2022), which compiles a diverse and challenging benchmark across several datasets. In particular, we focus on three popular datasets: CIFAR-100 (Krizhevsky & Hinton, 2010), which has significant value as a standard for fine-grained image classification due to its wide range of classes and detailed object distinctions; STL-10 (Coates et al., 2011), which is widely recognized for its limited sample size and extensive collection of unlabeled data, rendering it a challenging scenario of special significance in the context of SSL; and Euro SAT (Helber et al., 2019)
Dataset Splits	Yes	For CIFAR-100, a renowned benchmark for fine-grained image classification, we considered two label settings: 2 labeled samples and 4 labeled samples per class for each of the 100 classes,resulting in a total of 50,000 training samples and 10,000 samples for testing. Each image in CIFAR-100 is sized at 32 32 pixels. STL-10, known for its limited sample size and extensive unlabeled data, offers a unique challenge. We employed two label settings as well: 4 labeled samples and 10 labeled samples per class for all 10 classes, and an additional 100,000 unlabeled samples for training, along with 8,000 samples for testing. Lastly, Euro SAT, based on Sentinel-2 satellite images, features two label settings: 2 labeled samples per class and 4 samples per class for 10 classes. With a total of 16,200 training samples, including labeled and unlabeled images and 5,400 testing samples, Euro SAT images are sized at 64 64.
Hardware Specification	Yes	Training Free Match on CIFAR100 with 400 labeled samples goes from 12 days (Wide Res Net from scratch) to 10 hours (Vi TSmall) in an NVIDIA V100-32G GPU.
Software Dependencies	No	The paper mentions software components like 'Rand Augment' and 'cosine annealing scheduler', but it does not specify any version numbers for general software, libraries, or programming languages (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Regarding algorithm-independent hyperparameters, we adhered to the settings outlined in (Wang et al., 2022). Specifically, the learning rate was set to 5 10 4 for CIFAR-100, 10 4 for STL-10, and 5 10 5 for Euro SAT. During training, the batch size was fixed at 8, while for evaluation, it was set to 16. Additionally, the layer decay rate varied across datasets: 0.5 for CIFAR-100, 0.95 for STL-10, and 1.0 for Euro SAT. Weak augmentation techniques employed included random crop and random horizontal flip, while strong augmentation utilized Rand Augment (Cubuk et al., 2020). The cosine annealing scheduler was utilized with a total of 204,800 steps and a warm-up period of 5,120 steps. Both labeled and unlabeled batch sizes were set to 16.