Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Authors: Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a diverse set of language modeling tasks including symbol manipulation, knowledge retrieval, and instruction following we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model s behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. We compare our proposed methods against existing methods across a diverse set of five tasks under in-distribution (ID) and multiple out-of-distribution (OOD) settings. Our strongest method, counterfactual simulation, improves average AUC-ROC by 13.84% over prior baselines.
Researcher Affiliation Academia 1Stanford University. Correspondence to: Jing Huang <EMAIL>.
Pseudocode No The paper describes methods and workflows using text and mathematical equations but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Data and code available at https://github.com/explanare/ood-prediction
Open Datasets Yes We consider a variety of tasks that cover both idealized cases where we know the task mechanisms and open-ended ones where only partial or approximate mechanisms are identified. These tasks fall into three categories: (1) symbol manipulation tasks, including Indirect Object Identification (IOI; Wang et al., 2023) and Price Tag (Wu et al., 2023), for which the internal mechanisms used to solve the task are clearly known; (2) knowledge retrieval tasks, including RAVEL (Huang et al., 2024) and MMLU (Hendrycks et al., 2021), whose mechanisms are only partially understood; (3) an instruction following task: Unlearn HP (Thaker et al., 2024).
Dataset Splits Yes For each task, we randomly sample 3 folds from the datasets and split them into train/val/test. For each split, we ensure the ratio of correct and wrong examples is 1:1, i.e., we are evaluating each correctness estimation method on a balanced binary classification task. For all tasks except the Unlearn HP, each fold has 2048/1024/1024 examples for train/val/test sets. For Unlearn HP, due to the limited number of sentences sampled from the original books, we use 1024/512/512 examples.
Hardware Specification No The paper mentions using models like Llama-3-8B-Instruct, Llama-2-13b-chat-hf, and Llama-3-70B-Instruct, and refers to "computational constraints" for the 70B model, but it does not specify the exact GPU models, CPU types, memory, or other hardware specifications used for running experiments.
Software Dependencies No The paper mentions the use of specific language models (e.g., Llama-3-8B-Instruct) and an optimizer (Adam W), but it does not specify version numbers for programming languages, libraries (e.g., PyTorch, TensorFlow), or other key software components used for implementation.
Experiment Setup Yes For counterfactual simulation, we use 10K pairs randomly sampled from 1024x1024 correct examples as the training data. We use the Adam W optimizer with constant LR = 10 4 with no weight decay, trained for one epoch. For each task, we list several consecutive intervention layers around the optimal layer with the highest AUC-ROC score, as variables are likely distributed across several layers.