Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PUATE: Efficient ATE Estimation from Treated (Positive) and Unlabeled Units

Authors: Masahiro Kato, Fumiaki Kozai, RYO INOKUCHI

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section investigates the empirical performance of the proposed estimators. We also show the experimental results using semi-synthetic data in Appendix M. We generate synthetic data under the censoring setting, where the covariates X are drawn from a multivariate normal distribution as X ζ0(x), where ζ0(x) is the density of N(0, Ip), and Ip denotes the (p p) identity matrix. We set p = 3. Set P(D \| X) = trunc(sigmoid(X β), 0.1, 0.9), where β is a coefﬁcient sampled from N(0, 0.5Ip), and trunc(t, a, b) truncates t by a and b (a < b). Treatment D is sampled from the probability. The observation indicator O is generated from a Bernoulli distribution with probability c if Di = 1 and Oi = 0 if Di = 0. Here, c is generated from a uniform distribution with support [0, 1]. The outcome is generated as Y = X β +1.1+τ0 D +ε, where ε N(0, 1), where we set τ0 = 3. We set n = 3000. We conduct 5000 trials and report the empirical mean squared errors (MSEs) and biases for the true ATE and the coverage ratio (Cov. ratio) computed from the conﬁdence intervals in Table 1. We also present the empirical distributions of the ATE estimates in Figure 2.
Researcher Affiliation	Industry	Masahiro Kato Fumiaki Kozai Ryo Inokuchi Mizuho-DL Financial Technology Co., Ltd. Chiyoda ku, Tokyo 102 0083 EMAIL
Pseudocode	Yes	Algorithm 1 Cross-ﬁtting in the censoring setting... Algorithm 2 Cross-ﬁtting in the case-control setting
Open Source Code	No	Justiﬁcation: We will organize and provide the experimental code until the camera-ready.
Open Datasets	Yes	In this section, we investigate the empirical performance of our estimators using the Infant Health and Development Program (IHDP) dataset. The dataset contains simulated outcomes paired with covariates observed in the real world (Hill, 2011).
Dataset Splits	Yes	Algorithm 1 Cross-ﬁtting in the censoring setting Input: Observations D := Xi, Oi, Yi n i=1, number of folds L, and estimation methods for µT,0, ν0, π0. Let I = {1, 2, . . . , n} be the index set. Randomly split I into L roughly equal-sized folds, (I(ℓ))ℓ L. Note that S ℓ L I(ℓ) = I. for ℓ L do Set the training data as I( ℓ) = {1, 2, . . . , n} \ I(ℓ). Construct estimators of nuisance parameters on I( ℓ), denoted by bµ(ℓ) T,n, bν(ℓ) n , bπ(ℓ) n . end for Output: Obtain an ATE estimate bτ cens-eﬀ n using bµ(ℓ) T,n, bν(ℓ) n , and bπ(ℓ) n . We set n = 3000. We conduct 5000 trials and report the empirical mean squared errors (MSEs) and biases for the true ATE and the coverage ratio (Cov. ratio) computed from the conﬁdence intervals in Table 1. We set m = 1000 and l = 2000 and compute the same evaluation metrics as in the censoring setting.
Hardware Specification	Yes	All experiments were conducted on a Mac computer equipped with an Apple M2 processor and 24 GB of RAM.
Software Dependencies	No	The paper mentions using "linear regression and (linear) logistic regression" and "three-layer perceptrons with hidden layers of 100 nodes" for estimation, as well as specific PU learning methods from other papers (Elkan & Noto (2008), Kiryo et al. (2017)). However, it does not provide specific version numbers for any software libraries (e.g., Python, PyTorch, TensorFlow, scikit-learn, etc.) that would be necessary to replicate the computational environment.
Experiment Setup	No	The nuisance parameters are estimated using linear regression and (linear) logistic regression. We compared our proposed estimator, bτ cens-eﬀ n , with the other candidates, the IPW estimator bτ cens-IPW n and the DM estimator bτ cens-DM n , deﬁned in Remarks 4.4 and 4.4, respectively. We set n = 3000. We conduct 5000 trials and report the empirical mean squared errors (MSEs) and biases for the true ATE and the coverage ratio (Cov. ratio) computed from the conﬁdence intervals in Table 1. In the case-control setting, covariates for the treatment and unknown groups are generated from different p-dimensional normal distributions: XT ζT,0(x) and X ζ0(x) = e0(1)ζT,0(x) + e0(0)ζC(x), where we set p = 3, ζT,0(x) and ζC(x) are the densities of normal distributions N(µp1p, Ip) and N(µn1p, Ip), µp = 0.5 and µn = 0, 1p = (1 1 1) , and e0(1) is the class prior set as e0(1) = 0.3. By deﬁnition, the propensity score e0(d \| x) is given as e0(1)ζT,0(x)/ζ0(x). The outcome is generated similarly to the censoring setting Y = X β + 1.1 + τ0D + ε, where τ0 = 3.