Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Valid Inference with Imperfect Synthetic Data
Authors: Yewon Byun, Shantanu Gupta, Zachary C. Lipton, Rachel Childers, Bryan Wilder
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the finite-sample performance of our estimator across different tasks in computational social science applications, demonstrating large empirical gains. [...] Finally, in Section 5, we analyze the finite-sample performance of our estimator using real-world datasets that encompass varying computational social science tasks, demonstrating large empirical gains. [...] The results for our method s performance are shown in Figures 1 and 2. We will highlight some key observations. First, we observe that GMM-Synth achieves the lowest MSE, outperforming all baselines on 8 out of 8 downstream tasks. |
| Researcher Affiliation | Academia | Machine Learning Department, Carnegie Mellon University1 University of Zurich2 EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Cross-Fitting for PPI++Synth 1: Labeled data D = {(Ti, Xi, Yi)}n i=1, 2: Proxy data b D = {(Tj, b Xj, b Yj)}n+m j=1 , 3: Synthetic data e D = {( e Tj, e Xj, e Yj)}n+m j=1 , 4: K folds Ensure: Debiased estimate ˆθCF 5: Split D into folds {I1, . . . , IK} |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Yes, code and sufficient instructions to reproduce results are provided in the Supplementary Material. All data used is publicly available. |
| Open Datasets | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Yes, code and sufficient instructions to reproduce results are provided in the Supplementary Material. All data used is publicly available. [...] Datasets. We validate the finite-sample performance of our estimator for logistic regression and ordinary least squares (OLS) regression on the following 4 computational social science tasks: First, we use online requests posted on Stack Exchange and Wikipedia [Danescu-Niculescu-Mizil et al., 2013b]... Third, we use a corpus of climate-related news headlines [Hmielowski et al., 2014]... Lastly, we use congressional bills texts [Adler and Wilkerson, 2011]... [...] Asset Licenses The assets used in our work are subject to the following licenses: Stack Exchange and Wikipedia Data [Danescu-Niculescu-Mizil et al., 2013a]: CC BY-NC-SA 3.0; Climate news headlines data [Luo et al., 2020]: CC BY-NC-SA 3.0; Congressional bills data [Adler and Wilkerson, 2011]: MIT license; |
| Dataset Splits | Yes | Algorithm 1 Cross-Fitting for PPI++Synth 1: Labeled data D = {(Ti, Xi, Yi)}n i=1, 2: Proxy data b D = {(Tj, b Xj, b Yj)}n+m j=1 , 3: Synthetic data e D = {( e Tj, e Xj, e Yj)}n+m j=1 , 4: K folds Ensure: Debiased estimate ˆθCF 5: Split D into folds {I1, . . . , IK} |
| Hardware Specification | Yes | Compute Details Each experiment is run on a A6000 GPU. We evaluate and average performance over 200 random seeds for all experiments. |
| Software Dependencies | No | No specific software libraries with version numbers are mentioned in the paper. |
| Experiment Setup | Yes | We use GPT-4o [Hurst et al., 2024] without any task-specific fine-tuning to generate both proxy and synthetic data. [...] We evaluate our method s performance against the adapted baselines discussed in Section 5.1 using four key metrics: empirical mean-squared error (MSE), coverage, confidence interval width, and effective sample size. [...] The results for our method s performance are shown in Figures 1 and 2. [...] Results are averaged over 200 trials. [...] In our experiments, [...] we choose to adopt a linear regression model (defined over a small number of covariates), which satisfies the required Donsker conditions to enable us to avoid any requirements on sample splitting as in standard DML approaches [Van Der Vaart and Wellner, Chernozhukov et al., 2018]. [...] the exception to this is on the Congressional Bills dataset, where we use XGBoost as linear regression performs very poorly in estimating the score function. [...] We present the full text prompts that were used to generate proxy covariates and labels (for the proxy data) and synthetic data. |