Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?
Authors: Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, Bo Han
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we introduce a new causal Q&A benchmark called Causal Probe-2024, whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs exhibit a significant performance drop on Causal Probe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning. To bridge the gap towards level-2 causal reasoning, we draw inspiration from the fact that human reasoning is usually facilitated by general knowledge and intended goals. We propose G2-Reasoner, a method that incorporates general knowledge and goal-oriented prompts into LLMs causal reasoning processes. Experiments demonstrate that G2-Reasoner significantly enhances LLMs causal reasoning capability, particularly in fresh and counterfactual contexts. |
| Researcher Affiliation | Academia | Haoang Chi1,2 , He Li2 , Wenjing Yang2 , Feng Liu3, Long Lan2, Xiaoguang Ren1, Tongliang Liu4, Bo Han5 1 Intelligent Game and Decision Lab, 2 National University of Defense Technology, 3 University of Melbourne, 4 University of Sydney, 5 Hong Kong Baptist University |
| Pseudocode | Yes | Algorithm 1 Question assignment algorithm |
| Open Source Code | Yes | The Causal Probe 2024 benchmark and the source codes are presented in this URL: https://github.com/Haoang97/Causal Probe-2024. |
| Open Datasets | Yes | G2-Reasoner leverages a small ( 16 Mb) general knowledge Q&A dataset8 as the knowledge base, enabling the model to draw upon related knowledge... 8General knowledge dataset: https://huggingface.co/datasets/Muskum Pillerum/General-Knowledge. |
| Dataset Splits | No | The paper describes the Causal Probe 2024 benchmark and its usage for evaluation but does not specify explicit train/validation/test dataset splits for their own experiments. |
| Hardware Specification | Yes | All the experiments are conducted on the Ubuntu 20.04 system and NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions software like 'Faiss package' and 'Meta s Contriever' but does not provide specific version numbers for these libraries. |
| Experiment Setup | Yes | For LLM inference, we set the temperature parameter as 1.0 for closed-source LLMs all the time, and set it as 0 for open-source LLMs all the time. For Co T reasoning, we set the maximal new tokens as 128, and we set it as 50 for all other cases. |