Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?

Authors: Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, Bo Han

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we introduce a new causal Q&A benchmark called Causal Probe-2024, whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs exhibit a significant performance drop on Causal Probe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning. To bridge the gap towards level-2 causal reasoning, we draw inspiration from the fact that human reasoning is usually facilitated by general knowledge and intended goals. We propose G2-Reasoner, a method that incorporates general knowledge and goal-oriented prompts into LLMs causal reasoning processes. Experiments demonstrate that G2-Reasoner significantly enhances LLMs causal reasoning capability, particularly in fresh and counterfactual contexts.
Researcher Affiliation Academia Haoang Chi1,2 , He Li2 , Wenjing Yang2 , Feng Liu3, Long Lan2, Xiaoguang Ren1, Tongliang Liu4, Bo Han5 1 Intelligent Game and Decision Lab, 2 National University of Defense Technology, 3 University of Melbourne, 4 University of Sydney, 5 Hong Kong Baptist University
Pseudocode Yes Algorithm 1 Question assignment algorithm
Open Source Code Yes The Causal Probe 2024 benchmark and the source codes are presented in this URL: https://github.com/Haoang97/Causal Probe-2024.
Open Datasets Yes G2-Reasoner leverages a small ( 16 Mb) general knowledge Q&A dataset8 as the knowledge base, enabling the model to draw upon related knowledge... 8General knowledge dataset: https://huggingface.co/datasets/Muskum Pillerum/General-Knowledge.
Dataset Splits No The paper describes the Causal Probe 2024 benchmark and its usage for evaluation but does not specify explicit train/validation/test dataset splits for their own experiments.
Hardware Specification Yes All the experiments are conducted on the Ubuntu 20.04 system and NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper mentions software like 'Faiss package' and 'Meta s Contriever' but does not provide specific version numbers for these libraries.
Experiment Setup Yes For LLM inference, we set the temperature parameter as 1.0 for closed-source LLMs all the time, and set it as 0 for open-source LLMs all the time. For Co T reasoning, we set the maximal new tokens as 128, and we set it as 50 for all other cases.