Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
Authors: Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen Mckeown
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implemented two metrics based on counterfactual simulatability: precision and generality. We generated diverse counterfactuals automatically using LLMs. We then used these metrics to evaluate state-of-the-art LLMs on two tasks: multi-hop factual reasoning and reward modeling. |
| Researcher Affiliation | Academia | 1Columbia University 2UC Berkeley 3NYU Shanghai 4New York University. |
| Pseudocode | No | The paper describes its methods and evaluation pipeline in narrative form and with figures, but does not provide any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/yandachen/ Counterfactual Simulatability. |
| Open Datasets | Yes | We evaluate explanations on multi-hop reasoning (Strategy QA) and reward modeling (Stanford Human Preference). Strategy QA is a multi-hop question-answering dataset on open-domain questions (Geva et al., 2021). Stanford Human Preference (SHP) is a human preference dataset over agent responses to users questions and instructions (Bai et al., 2022). |
| Dataset Splits | No | The paper evaluates pre-trained large language models (GPT-3.5 and GPT-4) and does not describe a training process for its own methodology with explicit training/validation/test dataset splits. The datasets mentioned (Strategy QA, SHP) are used for evaluating the explanations generated by these pre-trained models. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments, only mentioning support from OpenAI for GPT-4 inference. |
| Software Dependencies | No | The paper mentions using GPT-3.5 and GPT-4, but it does not provide specific version numbers for any software libraries or dependencies, such as Python, PyTorch, or other relevant packages. |
| Experiment Setup | Yes | We generate ten counterfactuals per explanation for Strategy QA and six for SHP. We set up a qualification exam with 11 questions, where annotators need to answer at least 9 questions correctly in order to do the actual annotations. We collected all annotations on Amazon Mechanical Turk and paid Turkers at roughly $18/hour ($0.6/HIT). |