Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

Authors: Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen Mckeown

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implemented two metrics based on counterfactual simulatability: precision and generality. We generated diverse counterfactuals automatically using LLMs. We then used these metrics to evaluate state-of-the-art LLMs on two tasks: multi-hop factual reasoning and reward modeling.
Researcher Affiliation Academia 1Columbia University 2UC Berkeley 3NYU Shanghai 4New York University.
Pseudocode No The paper describes its methods and evaluation pipeline in narrative form and with figures, but does not provide any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/yandachen/ Counterfactual Simulatability.
Open Datasets Yes We evaluate explanations on multi-hop reasoning (Strategy QA) and reward modeling (Stanford Human Preference). Strategy QA is a multi-hop question-answering dataset on open-domain questions (Geva et al., 2021). Stanford Human Preference (SHP) is a human preference dataset over agent responses to users questions and instructions (Bai et al., 2022).
Dataset Splits No The paper evaluates pre-trained large language models (GPT-3.5 and GPT-4) and does not describe a training process for its own methodology with explicit training/validation/test dataset splits. The datasets mentioned (Strategy QA, SHP) are used for evaluating the explanations generated by these pre-trained models.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments, only mentioning support from OpenAI for GPT-4 inference.
Software Dependencies No The paper mentions using GPT-3.5 and GPT-4, but it does not provide specific version numbers for any software libraries or dependencies, such as Python, PyTorch, or other relevant packages.
Experiment Setup Yes We generate ten counterfactuals per explanation for Strategy QA and six for SHP. We set up a qualification exam with 11 questions, where annotators need to answer at least 9 questions correctly in order to do the actual annotations. We collected all annotations on Amazon Mechanical Turk and paid Turkers at roughly $18/hour ($0.6/HIT).