Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Language Models Can Predict Their Own Behavior

Authors: Dhananjay Ashok, Jonathan May

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide evidence that there are times when we can predict how an LM will behave early in computation, before even a single token is generated. We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence. Using methods from conformal prediction, we provide provable bounds on the estimation error of our probes, creating precise early warning systems for these behaviors. ... We train linear classifiers (probes) [4] that use an LM s internal representation of input tokens to predict the eventual behavior of its output. We then calibrate the probes using methods from conformal prediction [81]. ... The results (Figure 4, in blue) show that the method is highly effective at reducing the inference cost with minimal cost to accuracy.
Researcher Affiliation	Academia	Dhananjay Ashok Information Sciences Institute University of Southern California EMAIL Jonathan May Information Sciences Institute University of Southern California EMAIL
Pseudocode	No	The paper describes its methodology verbally and through high-level diagrams (e.g., Figure 1: Overview of our method) rather than explicit pseudocode or algorithm blocks.
Open Source Code	Yes	1Our code is accessible at: https://github.com/Dhananjay Ashok/LMBehavior Estimation and We have provided an anonymous link to our code: https://anonymous. 4open.science/r/LMBehavior Estimation/.
Open Datasets	Yes	We collect three QA datasets Natural QA [54], MSMarco [65] and Trivia QA [47]... We used the following datasets in our experiments, all usage is in accordance with their respective licenses. ... ARC: The AI2 Reasoning Challenge (ARC) [21] is a knowledge and reasoning challenge...
Dataset Splits	Yes	We evaluate whether the output format has been followed, and then sample this data to obtain training and testing splits that have an equal number of failures and successes. and Specifically, we use a held-out validation set Dvalid to calibrate the probe after training. and For each dataset, we select a maximum of 50,000 training instances to train (and validate) our probes, using the full test set to measure all metrics.
Hardware Specification	Yes	All of our experiments were run on a compute cluster with 8 NVIDIA A40 GPUs (approx 46068 Mi B of memory) on CUDA version 12.6. The CPU on the cluster is an AMD EPYC 7502 32-Core Processor. Most experiments could be conducted with less than 16GB of GPU RAM.
Software Dependencies	No	The paper mentions "CUDA version 12.6" in the hardware section. While it specifies the general software framework (Llama3.1-8B, bert-large-uncased) it does not list specific versions of Python, PyTorch, or other libraries. All of our experiments were run on a compute cluster with 8 NVIDIA A40 GPUs (approx 46068 Mi B of memory) on CUDA version 12.6.
Experiment Setup	Yes	We train linear classifiers (probes) [4] that use an LM s internal representation of input tokens to predict the eventual behavior of its output. We then calibrate the probes using methods from conformal prediction [81]. ... Unless specified otherwise, we set α = 0.9. ... We collect the output of the middle Transformer layer of Llama3.1-8B [26] when processing the final token of the input... We also fine-tune a bert-large-uncased [22] model for text classification... All LM inference uses greedy decoding and is hence deterministic.