Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
Authors: Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a diverse set of language modeling tasks including symbol manipulation, knowledge retrieval, and instruction following we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model s behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. We compare our proposed methods against existing methods across a diverse set of five tasks under in-distribution (ID) and multiple out-of-distribution (OOD) settings. Our strongest method, counterfactual simulation, improves average AUC-ROC by 13.84% over prior baselines. |
| Researcher Affiliation | Academia | 1Stanford University. Correspondence to: Jing Huang <EMAIL>. |
| Pseudocode | No | The paper describes methods and workflows using text and mathematical equations but does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Data and code available at https://github.com/explanare/ood-prediction |
| Open Datasets | Yes | We consider a variety of tasks that cover both idealized cases where we know the task mechanisms and open-ended ones where only partial or approximate mechanisms are identified. These tasks fall into three categories: (1) symbol manipulation tasks, including Indirect Object Identification (IOI; Wang et al., 2023) and Price Tag (Wu et al., 2023), for which the internal mechanisms used to solve the task are clearly known; (2) knowledge retrieval tasks, including RAVEL (Huang et al., 2024) and MMLU (Hendrycks et al., 2021), whose mechanisms are only partially understood; (3) an instruction following task: Unlearn HP (Thaker et al., 2024). |
| Dataset Splits | Yes | For each task, we randomly sample 3 folds from the datasets and split them into train/val/test. For each split, we ensure the ratio of correct and wrong examples is 1:1, i.e., we are evaluating each correctness estimation method on a balanced binary classification task. For all tasks except the Unlearn HP, each fold has 2048/1024/1024 examples for train/val/test sets. For Unlearn HP, due to the limited number of sentences sampled from the original books, we use 1024/512/512 examples. |
| Hardware Specification | No | The paper mentions using models like Llama-3-8B-Instruct, Llama-2-13b-chat-hf, and Llama-3-70B-Instruct, and refers to "computational constraints" for the 70B model, but it does not specify the exact GPU models, CPU types, memory, or other hardware specifications used for running experiments. |
| Software Dependencies | No | The paper mentions the use of specific language models (e.g., Llama-3-8B-Instruct) and an optimizer (Adam W), but it does not specify version numbers for programming languages, libraries (e.g., PyTorch, TensorFlow), or other key software components used for implementation. |
| Experiment Setup | Yes | For counterfactual simulation, we use 10K pairs randomly sampled from 1024x1024 correct examples as the training data. We use the Adam W optimizer with constant LR = 10 4 with no weight decay, trained for one epoch. For each task, we list several consecutive intervention layers around the optimal layer with the highest AUC-ROC score, as variables are likely distributed across several layers. |