Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Handling Missing Responses under Cluster Dependence with Applications to Language Model Evaluation

Authors: Zhenghao Zeng, David Arbour, Avi Feller, Ishita Dasgupta, Atanu Sinha, Edward H. Kennedy

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We further illustrate our findings through simulations and a real-world conversation quality dataset. Our theoretical and empirical results underscore the importance of incorporating cluster dependence in missing response problems to perform valid statistical inference.
Researcher Affiliation	Collaboration	Zhenghao Zeng Stanford University EMAIL David Arbour Adobe Research EMAIL Avi Feller University of California, Berkeley EMAIL Ishita Dasgupta Adobe Research EMAIL Atanu R Sinha Adobe Research EMAIL Edward H. Kennedy Carnegie Mellon University EMAIL
Pseudocode	No	The paper provides theoretical frameworks, mathematical derivations, and descriptions of methodologies in paragraph form. However, it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The codes to preprocess and analyze the data are provided in the supplementary materials.
Open Datasets	Yes	Our analysis focuses on the Open Assistant Conversations dataset (Köpf et al., 2023), a publicly available human-generated and human-annotated assistant-style conversation corpus 1. The dataset is structured as conversation trees, where each tree begins with an initial prompt message (root node) that can have multiple child messages as replies, which in turn can have their own responses.
Dataset Splits	Yes	Sample splitting is used, and all nuisance functions are estimated from half of the sample by the Super Learner (Polley et al., 2024) incorporating a generalized linear model and random forest.
Hardware Specification	Yes	The bootstrap procedure takes approximately 20 hours per outcome on a 12-core CPU machine.
Software Dependencies	Yes	Sample splitting is used, and all nuisance functions are estimated from half of the sample by the Super Learner (Polley et al., 2024) incorporating a generalized linear model and random forest.
Experiment Setup	Yes	C.1 Homogeneous Sampling: Consider the following data-generating process: For each cluster g, the cluster-level covariate Xg N(0, 1). Then the individual-level covariates Wg N(1ng Xg, σ2Σ) given Xg, where Σij = ρ\|i j\| for ρ = 0.8, σ2 = 4. For each individual i, the missing indicator Rgi is sampled from a Bernoulli distribution with mean π(Xg, Wgi) = logistic(Xg + 0.5Wgi) and the outcome Ygi is sampled from N( Xg + Wgi + 0.5, 1). ... In each replication of experiment, we generate the data with total sample size n = 10000 and cluster size ng = nα for α {0.1, 0.15, . . . , 0.5}, compute the doubly robust estimator ˆθDR and construct Wald-confidence intervals based on ˆσ2 1, ˆσ2 2. We then repeat the process M = 500 times and estimate the coverage probability of the 95% confidence intervals obtained. C.2 Sequential Sampling: Consider the following data-generating process: For each cluster g, the cluster-level covariate Xg N(0, 1). The individual-level covariates are generated sequentially from an AR(2) process. Specifically, Wgt = A1Wg,t 1 + A2Wg,t 2 + ϵt, ϵt N(0, 4I2). ... For each time t, the missing indicator Rgt is sampled from a Bernoulli distribution with mean π(Xg, Sgt) = logistic(Xg + (1, 0.8, 0.5, 0.3) Sgt) and the outcome Ygi is sampled from N( Xg + (1, 1, 0.5, 0.4) Sgt + 1, 1). ... In each replication of the experiment, we generate the data with total sample size n {2000, 4000, . . . , 16000} and cluster size ng = n0.4, compute two doubly robust estimators ˆψDR adjusting for different information and evaluate the estimation error. We then repeat the process M = 500 times and estimate the Rooted-Mean-Squared-Error (RMSE).