Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Prediction-Powered Semi-Supervised Learning with Online Power Tuning

Authors: Noa Shoham, Ron Dorfman, Shalev Shaer, Kfir Y. Levy, Yaniv Romano

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments on both real and synthetic data demonstrate the advantage of our online SSL framework on regression and classification tasks. Our method outperforms both classic (biased) SSL and PPI-like training methods in scenarios where there is a subgroup in the data on which the teacher model performs poorly.
Researcher Affiliation	Academia	1Department of Electrical and Computer Engineering, Technion IIT 2Department of Computer Science, Technion IIT
Pseudocode	Yes	Algorithm 1: PP-SSL with Online Tuning
Open Source Code	Yes	Software for reproducing the experiments is available at https://github.com/noashoham/PP-SSL
Open Datasets	Yes	We evaluate our method on the California Housing dataset [34], which contains 8 numeric features and a target variable representing house prices, across 20,640 samples. We use the facial age estimation UTKFace dataset [35]. This dataset contains about 20,000 face images with age annotations ranging from 0 to 116 years. We evaluate our method on the CIFAR-10 dataset [37] and its corrupted variant, CIFAR-10C [38].
Dataset Splits	Yes	We set n = 20, N = 1,000, and ntest = 1,000. We split the data by setting n 100, N 18,000, nval 1,000 for validation, and ntest 1,000 for testing. We split the data by setting n 700, N 16,500, nval 2,000, and ntest 2,000. CIFAR-10 contains 60,000 images of size 32 32 from 10 classes, split into 50,000 training images and 10,000 test images.
Hardware Specification	Yes	The experiments were conducted on a system running Ubuntu 20.04.6 LTS, each experimnt (single seed) with 2 CPU cores of Intel(R) Xeon(R) Gold CPUs at 2.40 GHz, 32 GB of RAM. All experiments were run on an Ubuntu 20.04.6 LTS system. In order to run single seed experiment, hardware included 2 CPU cores from an Intel(R) Xeon(R) Gold processor at 2.40 GHz and 32 GB of RAM. All experiments were conducted on a high-performance computing cluster running Ubuntu 20.04.6 LTS. The hardware configuration includes 98 CPU cores (Intel(R) Xeon(R) Gold 2.40GHz), 256 GB of RAM, and 8 NVIDIA A40 GPUs.
Software Dependencies	Yes	The software environment used Python 3.11.3 and Py Torch 2.5.1. The software environment used Python 3.11.3. The software stack consists of Python 3.11.3, Py Torch 2.5.1, and CUDA 12.4.
Experiment Setup	Yes	All methods (except the Teacher) implemented by fitting a linear regression model using ADAM optimizer with the same hyper-parameters and batch size. Table S1: Experimental settings for synthetic regression and two-groups regression tasks Optimizer Adam Batch size 256 Learning rate 0.001 Epochs 3000 Table S2: California Housing Experiment Parameters Optimizer SGD Batch size Full dataset Learning rate 0.01 Epochs 1000 Table S3: Training and experimental configuration for the UTKFace age estimation experiments. Loss function L1 Optimizer SGD Momentum 0.9 Weight decay 0.001 Initial learning rate 0.001 Batch size 512 Epochs 100 Table S4: California Housing Experiment Parameters Optimizer SGD Batch size 256 Learning rate 0.001 Labeled Loss CE Unlabeled Loss KL divergence