Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Prediction-Powered Adaptive Shrinkage Estimation

Authors: Sida Li, Nikolaos Ignatiadis

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on both synthetic and real-world datasets show that PAS adapts to the reliability of the ML predictions and outperforms traditional and modern baselines in large-scale applications. We conduct extensive experiments on both synthetic and real-world datasets.
Researcher Affiliation	Academia	1Data Science Institute, The University of Chicago 2Department of Statistics, The University of Chicago. Correspondence to: Sida Li <EMAIL>.
Pseudocode	Yes	A pseudo-code implementation is also presented in Algorithm 1.
Open Source Code	Yes	The code for reproducing the experiments is available at https://github.com/listar2000/predictionpowered-adaptive-shrinkage.
Open Datasets	Yes	Experiments on both synthetic and real-world datasets show that PAS adapts to the reliability of the ML predictions and outperforms traditional and modern baselines in large-scale applications. Fisch et al. (2024) have shown improvements in estimating the fraction of spiral galaxies using predictions on images from the Galaxy Zoo 2 dataset (Willett et al., 2013). Amazon Review Ratings (SNAP, 2014). The Amazon Fine Food Reviews dataset, provided by the Stanford Network Analysis Project (SNAP; SNAP (2014)) on Kaggle.
Dataset Splits	Yes	we randomly split the data points of each problem into labeled/unlabeled partitions (where we choose a 20/80 split ratio). For both datasets, we randomly split the data for each problem (a food product or galaxy subgroup) into a labeled and unlabeled partition with a 20/80 ratio.
Hardware Specification	Yes	All the experiments were conducted on a compute cluster with Intel Xeon Silver 4514Y (16 cores) CPU, Nvidia A100 (80GB) GPU, and 64GB of memory.
Software Dependencies	No	The paper mentions software like 'Hugging Face s transformers library (Wolf, 2019)', 'bert-base-multilingual-uncased-sentiment model (Town, 2023)', 'Res Net50 architecture (He et al., 2016)', and 'Adam optimizer (Kingma & Ba, 2015)'. However, specific version numbers for these software libraries or frameworks are not provided, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	We use a batch size of 256 and Adam optimizer (Kingma & Ba, 2015) with a learning rate of 1e-3. After 20 epochs, the model achieves 87% training accuracy and 83% test accuracy.