Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SimpleStrat: Diversifying Language Model Generation with Stratification

Authors: Justin Wong, Yury Orlovskiy, Alexander Shypula, Michael Luo, Sanjit A. Seshia, Joseph E Gonzalez

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To measure resampling diversity, we introduce Coverage QA, a dataset of underspecified questions with multiple equally plausible answers. We propose measuring resampling diversity as the KL Divergence between the response distribution and the uniform distribution over valid ground truth answers and use recall as an alternative when assessing proprietary models. On Coverage QA, Simple Strat improves diversity across all temperatures, showing orthogonal benefits. Quantifiably, we achieve as much as 4X better recall when applied to GPT-4o, and an average reduction in KL divergence by 0.36 when applied to Llama 3. Furthermore, we show that Simple Strat achieves more resampling diversity at temperature T=0 than scaling temperature to T=1 on creative writing, an open-ended domain.
Researcher Affiliation	Academia	Justin Wong UC Berkeley EMAIL Yury Orlovskiy UC Berkeley EMAIL Alexander Shypula University of Pennsylvania EMAIL Michael Luo UC Berkeley EMAIL Sanjit A. Seshia UC Berkeley EMAIL Joseph E. Gonzalez UC Berkeley EMAIL
Pseudocode	No	The paper describes the Simple Strat workflow in natural language and using flow diagrams in Figure 2. It does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code	Yes	Implementation and dataset available at https://github.com/jwong8314/simplestrat .
Open Datasets	Yes	To measure resampling diversity, we introduce Coverage QA, a dataset of underspecified questions with multiple equally plausible answers. [...] Implementation and dataset available at https://github.com/jwong8314/simplestrat .
Dataset Splits	Yes	The dataset consists of two splits: Coverage QA-Curated, manually curated naturally underspecified questions, and Coverage QA-Wikipedia an auto-generated dataset of underspecified questions.
Hardware Specification	Yes	The inference of these models were run on 8 A100-80GB GPUs.
Software Dependencies	No	The paper mentions specific LLM models (gpt-4o-2024-08-06, claude-3.5-sonnet-20240620, Llama 3 and 3.1 families) and one library (pyspellcheck) but does not provide version numbers for general software dependencies like programming languages (e.g., Python), frameworks (e.g., PyTorch), or the pyspellcheck library itself. The LLM model IDs are not considered ancillary software dependencies in this context.
Experiment Setup	Yes	We compare the coverage diversity (recall) of Simple Strat, GPT-4o, and Claude 3.5 Sonnet as a function of temperature. We sweep over temperatures from 0 to 1.5. [...] Table 1: Performance of Different Prompting Strategies across Temperature Settings (GPT-4o) Temp. GPT-4o (std) Simple Strat (std) 20Q Abl. (std) Single Prompt Abl.(std) 0 0.0646 (0.0011) 0.2423 (0.0050) [...] 1.5 0.2676 (0.0059) 0.4634 (0.0085) 0.3304 (0.0104)