CLadder: Assessing Causal Reasoning in Language Models

Authors: Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng LYU, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, Bernhard Schölkopf

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CAUSALCOT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insights into the causal reasoning abilities of LLMs.
Researcher Affiliation Academia Zhijing Jin1,2, , Yuen Chen1, , Felix Leeb1, , Luigi Gresele1, , Ojasv Kamal3, Zhiheng Lyu4, Kevin Blin2, Fernando Gonzalez2, Max Kleiman-Weiner5, Mrinmaya Sachan2, Bernhard Schölkopf1 1MPI for Intelligent Systems, Tübingen 2ETH Zürich 3IIT Kharagpur 4University of Hong Kong 5University of Washington
Pseudocode Yes Correct steps to lead to the ground-truth answer: ... Figure 1: Example question in our CLADDER dataset featuring an instance of Simpson s paradox [63]. We generate the following (symbolic) triple: (i) the causal query; (ii) the ground-truth answer, derived through a causal inference engine [66]; and (iii) a step-by-step explanation. We then verbalize these questions by turning them into stories, inspired by examples from the causality literature, which can be expressed in natural language. Our Causal Chain-of-Thought (Causal Co T) Model: Guidance: Address the question by following the steps below: ... Figure 4: Illustration of our CAUSALCOT prompting strategy, which designs a chain of subquestions inspired by the idea of a CI engine [66].
Open Source Code Yes Our data is open-sourced at https://huggingface.co/datasets/causal NLP/cladder, and our code can be found at https://github.com/causal NLP/cladder.
Open Datasets Yes We compose a large dataset, CLADDER, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. Our data is open-sourced at https://huggingface.co/datasets/causal NLP/cladder, and our code can be found at https://github.com/causal NLP/cladder.
Dataset Splits No The paper reports statistics for its CLADDER dataset but does not explicitly describe how the dataset is split into training, validation, and test sets for experimental purposes.
Hardware Specification No The paper mentions using the OpenAI API for proprietary models and running open-source models (LLaMA, Alpaca) but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using the OpenAI API and the Language Tool package, but it does not provide specific version numbers for these or any other software dependencies needed to reproduce the experiments.
Experiment Setup Yes Given a causal question q, we provide the LLM a list of instructions ℓ:= (s1, . . . , s6) consisting of the detailed descriptions of the six steps s1, . . . , s6 in Figure 4. As the model f LLM : si 7 ri autoregressively produces responses r1, , r6 sequentially corresponding to the six steps, we concatenate all the above before asking the final question Based on all the reasoning above, output one word to answer the initial question with just Yes or No . See the complete prompt in Appendix B.1. In the end, we obtain the binary answer a {Yes, No} as the final result.