Can Large Language Models Infer Causation from Correlation?

Authors: Zhijing Jin, Jiarui Liu, Zhiheng LYU, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona T. Diab, Bernhard Schölkopf

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We curate a large-scale dataset of more than 200K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task.
Researcher Affiliation Collaboration 1Max Planck Institute for Intelligent Systems, Tübingen, Germany 2ETH Zürich 3LTI, CMU 4University of Hong Kong 5Meta AI 6University of Michigan
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Figure 2 illustrates a data construction pipeline, but it is not pseudocode.
Open Source Code Yes Our code is at https://github.com/causal NLP/corr2cause.
Open Datasets Yes Our data is at https://huggingface.co/datasets/causalnlp/corr2cause.
Dataset Splits Yes We split the data into 205,734 training samples, 1,076 development and 1,162 test samples.
Hardware Specification Yes we train the models on a server with an NVIDIA Tesla A100 GPU with 40G of memory.
Software Dependencies No The paper mentions using the "transformers library (Wolf et al., 2020)" and "Open AI finetuning API" but does not provide specific version numbers for these software components.
Experiment Setup Yes We use the validation set to tune the learning rate, which takes value in {2e-6, 5e-6, 1e-5, 2e-5, 5e-5}; dropout rate, which takes value in {0, 0.1, 0.2, 0.3}; and weight decay, which takes value in {1e-4, 1e-5}. We train the models until convergence, which is usually around ten epochs.