Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robust Sampling for Active Statistical Inference

Authors: Puheng Li, Tijana Zrnic, Emmanuel Candes

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We turn to evaluating the performance of our robust sampling approach empirically. Each of the following subsections is dedicated to a different experiment using social science research data. Section 4.1 measures presidential approval, Section 4.2 analyzes US age income patterns, and Section 4.3 applies language models to score text on social attributes such as political bias. On each of these datasets, we use the following methods to collect labels: (1) uniform sampling, which essentially recovers prediction-powered inference [1]; (2) standard uncertainty-based active sampling [51]; and (3) our robust active method as per Algorithm 1.
Researcher Affiliation Academia Puheng Li Department of Statistics Stanford University Stanford, CA 94305 EMAIL Tijana Zrnic Department of Statistics and Stanford Data Science Stanford University Stanford, CA 94305 EMAIL Emmanuel J. Candes Department of Statistics and Department of Mathematics Stanford University Stanford, CA 94305 EMAIL
Pseudocode Yes Algorithm 1: Robust Active Inference
Open Source Code Yes The source code for all experiments is available at https://github.com/lph-Leo/Robust-Active-Statistical-Inference.
Open Datasets Yes Following [51], we evaluate the different methods on survey data collected by the Pew Research Center following the 2020 United States presidential election, aiming at gauging people s approval of the presidential candidates political messaging [32]. We study the annual American Community Survey (ACS) Public Use Microdata Sample (PUMS) collected by the US Census Bureau [12]. In the first task, the goal is to study the political leaning of media articles, using the data curated by Baly et al. [4]. The next task is to estimate how certain linguistic devices impact the perceived politeness of online requests. We use the dataset of requests from Wikipedia and Stack Exchange curated by Danescu-Niculescu-Mizil et al. [11]. Finally, we study the prevalence of misinformation in news headlines, using the dataset collected by Gabrel et al. [15].
Dataset Splits No The paper mentions a "burn-in period" for estimating the error function and resamples 500 times for coverage estimation, but does not provide conventional train/test/validation splits for model training or evaluation of the active sampling methods in the main text.
Hardware Specification No The paper states that the algorithm is "computationally efficient" but does not provide specific hardware details (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies No The paper mentions using a "multilayer perceptron (MLP)", an "XGBoost model [9]", and "GPT-4o annotations", but it does not specify software dependencies with version numbers (e.g., Python, specific libraries like PyTorch or Scikit-learn versions).
Experiment Setup Yes We set the target coverage level to be 0.9 throughout. We resample 500 times to estimate the coverage. We use a multilayer perceptron (MLP) as our predictive model f. To implement active inference, we use π(x) min{f(x), 1 f(x)}, in which f(x) is the predicted probability that the label takes on the value 1, as considered in [51]. We use an XGBoost model [9] to predict income Y from available demographic covariates. For robust active inference, we use the geometric path and robust optimization with an ℓ2 constraint set C, as before.