Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unveiling Extraneous Sampling Bias with Data Missing-Not-At-Random

Authors: Chunyuan Zheng, Haocheng Yang, Haoxuan Li, Mengyue Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on three real-world datasets, including a large-scale industrial dataset, to show the effectiveness of our method. The code is available at https://github.com/Chunyuan Zheng/neurips-25-Super DR.
Researcher Affiliation	Academia	1Peking University 2National University of Singapore 3University of Bristol EMAIL
Pseudocode	Yes	Algorithm 1: The Proposed Doubly Robust Joint Learning Algorithm under Super-population
Open Source Code	Yes	The code is available at https://github.com/Chunyuan Zheng/neurips-25-Super DR.
Open Datasets	Yes	To verify the effectiveness of the proposed method in the real-world dataset, the dataset that contains both biased and unbiased data is required. Following the previous studies [14, 15, 32, 66], the following three widely used real-world datasets are adopted to conduct our experiments: Coat contains ratings from 290 users to 300 items with 6,960 biased ratings and 4,640 unbiased ratings. Yahoo! R3 contains ratings from 15,400 users to 1,000 items with 311,704 biased ratings and 54,000 unbiased ratings. We binarize the ratings to 0 for ratings less than three, otherwise to 1. We further use a fully exposed industrial dataset Kuai Rec [67] with 4,676,570 video watching ratio records from 1,411 users to 3,327 videos. Following previous studies [59, 60], we biasedly select 201,171 samples according to the watch ratio as the training set and randomly select 117,113 samples as the unbiased test set.
Dataset Splits	Yes	To simulate the super-population scenario, we first randomly sample b% users and items (unless otherwise stated, b is set to 50% in our experiments) from the training set and then use the whole unbiased test set to evaluate the debiasing performance. ...Following previous studies [59, 60], we biasedly select 201,171 samples according to the watch ratio as the training set and randomly select 117,113 samples as the unbiased test set.
Hardware Specification	Yes	All the experiments are implemented on PyTorch with the GeForce RTX 3090 as the computational resource.
Software Dependencies	No	All the experiments are implemented on PyTorch with the GeForce RTX 3090 as the computational resource. Adam is utilized as the optimizer in all experiments.
Experiment Setup	Yes	The following three metrics are used to measure the debiasing performance: AUC, NDCG@K, and Recall@K, where we set K = 5 for Coat and Yahoo! R3, while set K = 50 for Kuai Rec. All the experiments are implemented on PyTorch with the GeForce RTX 3090 as the computational resource. Adam is utilized as the optimizer in all experiments. To simulate the super-population scenario, we first randomly sample b% users and items (unless otherwise stated, b is set to 50% in our experiments) from the training set and then use the whole unbiased test set to evaluate the debiasing performance. ... In addition, the dimension of user and item embedding are fixed as 32. We tune learning rate in {0.001, 0.005, 0.01, 0.02, 0.05} for parameters in prediction, imputation, and propensity model, and in {0.01, 0.05, 0.1, 0.15, 0.2} for ϵ, batch size in {128, 256, 512} for Coat and {1024, 2048, 4096} for Yahoo! R3 and Kuai Rec. The weight decay is tuned in {1e-6, 5e-6, . . . , 5e-3, 1e-2}. In addition, we use the logistic regression model as the propensity model... we tune the propensity clip threshold in [0.005, 0.05]. For simplicity, we fix the step in inner loop for updating prediction and imputation models in Algorithm 1 as 1.