Can Large Language Models Infer Causation from Correlation?
Authors: Zhijing Jin, Jiarui Liu, Zhiheng LYU, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona T. Diab, Bernhard Schölkopf
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We curate a large-scale dataset of more than 200K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. |
| Researcher Affiliation | Collaboration | 1Max Planck Institute for Intelligent Systems, Tübingen, Germany 2ETH Zürich 3LTI, CMU 4University of Hong Kong 5Meta AI 6University of Michigan |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Figure 2 illustrates a data construction pipeline, but it is not pseudocode. |
| Open Source Code | Yes | Our code is at https://github.com/causal NLP/corr2cause. |
| Open Datasets | Yes | Our data is at https://huggingface.co/datasets/causalnlp/corr2cause. |
| Dataset Splits | Yes | We split the data into 205,734 training samples, 1,076 development and 1,162 test samples. |
| Hardware Specification | Yes | we train the models on a server with an NVIDIA Tesla A100 GPU with 40G of memory. |
| Software Dependencies | No | The paper mentions using the "transformers library (Wolf et al., 2020)" and "Open AI finetuning API" but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We use the validation set to tune the learning rate, which takes value in {2e-6, 5e-6, 1e-5, 2e-5, 5e-5}; dropout rate, which takes value in {0, 0.1, 0.2, 0.3}; and weight decay, which takes value in {1e-4, 1e-5}. We train the models until convergence, which is usually around ten epochs. |