Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Semi-Supervised Regression with Heteroscedastic Pseudo-Labels

Authors: Xueqing Sun, Renzhen Wang, Quanziang Wang, Yichen WU, Xixi Jia, Deyu Meng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide theoretical insights and extensive experiments to validate our approach across various benchmark SSR datasets, and the results demonstrate superior robustness and performance compared to existing methods. Our code is available at https://github.com/sxq/Heteroscedastic Pseudo-Labels.
Researcher Affiliation	Academia	1 Xi an Jiaotong University 2 City University of Hong Kong 3 Harvard University 4 Xidian University 5 Pazhou Laboratory (Huangpu)
Pseudocode	Yes	Algorithm 1 Mini-batch Training Algorithm of the Method
Open Source Code	Yes	Our code is available at https://github.com/sxq/Heteroscedastic Pseudo-Labels.
Open Datasets	Yes	We evaluate the effectiveness of our algorithm on three benchmark datasets: UTKFace [58], an image-based age estimation dataset; IMDB-WIKI [39], a large-scale dataset for age estimation; and STS-B [5, 45], a benchmark for assessing semantic similarity between sentence pairs. Please refer to Appendix B.1 for more detailed information about these datasets.
Dataset Splits	Yes	The resulting modified dataset includes 10,518 training samples, 3,287 testing samples, and 2,629 validation samples. ... IMDB-WIKI [39]. ... consisting of 191.5K images for training and 11.0K images for validation and testing. ... STS-B [5, 45]. ... construct the training set containing 5.2K pairs, and both the balanced validation set and the test set containing 1K pairs each.
Hardware Specification	Yes	We evaluate the computational cost on a single NVIDIA GeForce RTX 4090 for fair comparison.
Software Dependencies	No	The model is trained by Adam optimizer [24] for 30 epochs, with a learning rate of 10^-4 for the feature extractor and 10^-3 for the regression head. Additionally, we conduct random cropping and horizontal flipping as weak augmentation, and Rand Augment [8] as strong augmentation in the data augmentation process of SSL. As for gϕ, it is optimized by Adam optimizer with a learning rate of 10^-4. ... For the STS-B dataset, we adopt a Bi LSTM-based regression model with GloVe word embeddings, in line with [45, 53].
Experiment Setup	Yes	The model is trained by Adam optimizer [24] for 30 epochs, with a learning rate of 10^-4 for the feature extractor and 10^-3 for the regression head. Additionally, we conduct random cropping and horizontal flipping as weak augmentation, and Rand Augment [8] as strong augmentation in the data augmentation process of SSL. As for gϕ, it is optimized by Adam optimizer with a learning rate of 10^-4. ... For the STS-B dataset, we adopt a Bi LSTM-based regression model with GloVe word embeddings, in line with [45, 53], which is trained by Adam optimizer for 200 epochs, with a learning rate of 10^-4 for the feature extractor and 10^-3 for the regression head.