Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SimpleStrat: Diversifying Language Model Generation with Stratification
Authors: Justin Wong, Yury Orlovskiy, Alexander Shypula, Michael Luo, Sanjit A. Seshia, Joseph E Gonzalez
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To measure resampling diversity, we introduce Coverage QA, a dataset of underspecified questions with multiple equally plausible answers. We propose measuring resampling diversity as the KL Divergence between the response distribution and the uniform distribution over valid ground truth answers and use recall as an alternative when assessing proprietary models. On Coverage QA, Simple Strat improves diversity across all temperatures, showing orthogonal benefits. Quantifiably, we achieve as much as 4X better recall when applied to GPT-4o, and an average reduction in KL divergence by 0.36 when applied to Llama 3. Furthermore, we show that Simple Strat achieves more resampling diversity at temperature T=0 than scaling temperature to T=1 on creative writing, an open-ended domain. |
| Researcher Affiliation | Academia | Justin Wong UC Berkeley EMAIL Yury Orlovskiy UC Berkeley EMAIL Alexander Shypula University of Pennsylvania EMAIL Michael Luo UC Berkeley EMAIL Sanjit A. Seshia UC Berkeley EMAIL Joseph E. Gonzalez UC Berkeley EMAIL |
| Pseudocode | No | The paper describes the Simple Strat workflow in natural language and using flow diagrams in Figure 2. It does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps. |
| Open Source Code | Yes | Implementation and dataset available at https://github.com/jwong8314/simplestrat . |
| Open Datasets | Yes | To measure resampling diversity, we introduce Coverage QA, a dataset of underspecified questions with multiple equally plausible answers. [...] Implementation and dataset available at https://github.com/jwong8314/simplestrat . |
| Dataset Splits | Yes | The dataset consists of two splits: Coverage QA-Curated, manually curated naturally underspecified questions, and Coverage QA-Wikipedia an auto-generated dataset of underspecified questions. |
| Hardware Specification | Yes | The inference of these models were run on 8 A100-80GB GPUs. |
| Software Dependencies | No | The paper mentions specific LLM models (gpt-4o-2024-08-06, claude-3.5-sonnet-20240620, Llama 3 and 3.1 families) and one library (pyspellcheck) but does not provide version numbers for general software dependencies like programming languages (e.g., Python), frameworks (e.g., PyTorch), or the pyspellcheck library itself. The LLM model IDs are not considered ancillary software dependencies in this context. |
| Experiment Setup | Yes | We compare the coverage diversity (recall) of Simple Strat, GPT-4o, and Claude 3.5 Sonnet as a function of temperature. We sweep over temperatures from 0 to 1.5. [...] Table 1: Performance of Different Prompting Strategies across Temperature Settings (GPT-4o) Temp. GPT-4o (std) Simple Strat (std) 20Q Abl. (std) Single Prompt Abl.(std) 0 0.0646 (0.0011) 0.2423 (0.0050) [...] 1.5 0.2676 (0.0059) 0.4634 (0.0085) 0.3304 (0.0104) |