Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?
Authors: Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, Bo Han
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we introduce a new causal Q&A benchmark called Causal Probe-2024, whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs exhibit a significant performance drop on Causal Probe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning. To bridge the gap towards level-2 causal reasoning, we draw inspiration from the fact that human reasoning is usually facilitated by general knowledge and intended goals. We propose G2-Reasoner, a method that incorporates general knowledge and goal-oriented prompts into LLMs causal reasoning processes. Experiments demonstrate that G2-Reasoner significantly enhances LLMs causal reasoning capability, particularly in fresh and counterfactual contexts. |
| Researcher Affiliation | Academia | Haoang Chi1,2 , He Li2 , Wenjing Yang2 , Feng Liu3, Long Lan2, Xiaoguang Ren1, Tongliang Liu4, Bo Han5 1 Intelligent Game and Decision Lab, 2 National University of Defense Technology, 3 University of Melbourne, 4 University of Sydney, 5 Hong Kong Baptist University |
| Pseudocode | Yes | Algorithm 1 Question assignment algorithm |
| Open Source Code | Yes | The Causal Probe 2024 benchmark and the source codes are presented in this URL: https://github.com/Haoang97/Causal Probe-2024. |
| Open Datasets | Yes | G2-Reasoner leverages a small ( 16 Mb) general knowledge Q&A dataset8 as the knowledge base, enabling the model to draw upon related knowledge... 8General knowledge dataset: https://huggingface.co/datasets/Muskum Pillerum/General-Knowledge. |
| Dataset Splits | No | The paper describes the Causal Probe 2024 benchmark and its usage for evaluation but does not specify explicit train/validation/test dataset splits for their own experiments. |
| Hardware Specification | Yes | All the experiments are conducted on the Ubuntu 20.04 system and NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions software like 'Faiss package' and 'Meta s Contriever' but does not provide specific version numbers for these libraries. |
| Experiment Setup | Yes | For LLM inference, we set the temperature parameter as 1.0 for closed-source LLMs all the time, and set it as 0 for open-source LLMs all the time. For Co T reasoning, we set the maximal new tokens as 128, and we set it as 50 for all other cases. |