Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Predicting the Performance of Black-box Language Models with Follow-up Queries

Authors: Dylan Sam, Marc Finzi, Zico Kolter

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that querying a model with follow-up questions yields features that are highly predictive of performance on LLM benchmarks. We show that simple linear models trained on these features accurately predict instance-level correctness on question-answering and reasoning tasks. Surprisingly, our black-box approach often matches or even outperforms white-box methods that operate over the language model s hidden state, across a range of different language models and benchmarks. Furthermore, we demonstrate that our predictors admit nice generalization properties due to their low-dimensional nature and perform well on out-of-distribution data (e.g., transferred to new model scales or new datasets) due to our approach s generality.
Researcher Affiliation Academia Dylan Sam Carnegie Mellon University Marc Finzi Carnegie Mellon University J. Zico Kolter Carnegie Mellon University
Pseudocode No The paper describes methods in narrative text and includes theoretical propositions and proofs in the appendix, but it does not present any structured pseudocode or algorithm blocks for its methodology.
Open Source Code Yes Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code is provided in the supplement.
Open Datasets Yes We consider the open-ended QA benchmarks NQ [Kwiatkowski et al., 2019] and SQu AD [Rajpurkar et al., 2016]), as well as the closed-ended QA datasets of Bool Q [Clark et al., 2019], Wino Grande [Sakaguchi et al., 2021], Halu Eval [Li et al., 2023], DHate [Vidgen et al., 2021], and CS QA [Talmor et al., 2019]). These datasets encompass commonsense reasoning, hallucination detection, factual recall, and toxicity classification. Finally, we also evaluate on math (GSM8K [Cobbe et al., 2021]) and code (Code Contests [Li et al., 2022]) benchmarks to evaluate if our approach is predictive of tasks that require reasoning.
Dataset Splits Yes For all datasets, we truncate the number of training examples to the first 5000 instances from each dataset s original train split (if they are longer than 5000 examples). We take the first 1000 instances from each test split to construct our test dataset. For the experiments with the LLa MA3-70B and GPT models, we use 1000 instances for the training datasets due to computational costs.
Hardware Specification Yes Our largest experiments are with LLa MA3-70B, which are run on a single node with 4 NVIDIA RTX A6000 GPUs. The other experiments are run with 2 RTX A6000 GPUs.
Software Dependencies No To train our downstream logistic regression models, we use the default settings from scikit-learn, with the default (L2) regularization. While scikit-learn is mentioned, no specific version number is provided for it or any other key software component.
Experiment Setup Yes To train our downstream logistic regression models, we use the default settings from scikit-learn, with the default (L2) regularization. We balance the logistic regression objective due to the unbalanced nature of the task (e.g., models are mostly incorrect on very challenging tasks). In all of the text generation tasks, we sample greedily from the LLM for its answer. For evaluating model performance on Natural Questions (NQ) [Kwiatkowski et al., 2019], we measure if the LLM has outputted one of the valid answers to the question. As mentioned previously, we use GPT-4o as a LLM judge to assess performance on Code Contests and on GSM8k. For all MCQ tasks, we use the standard set of answers of ( True , False ) or ( A , B , C , D , E ) when they are the existing formatting in the dataset.