reproducibilityindex.ai

Can Large Language Models Infer Causation from Correlation?

Authors: Zhijing Jin, Jiarui Liu, Zhiheng LYU, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona T. Diab, Bernhard Schölkopf

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We curate a large-scale dataset of more than 200K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task.
Researcher Affiliation	Collaboration	1Max Planck Institute for Intelligent Systems, Tübingen, Germany 2ETH Zürich 3LTI, CMU 4University of Hong Kong 5Meta AI 6University of Michigan
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. Figure 2 illustrates a data construction pipeline, but it is not pseudocode.
Open Source Code	Yes	Our code is at https://github.com/causal NLP/corr2cause.
Open Datasets	Yes	Our data is at https://huggingface.co/datasets/causalnlp/corr2cause.
Dataset Splits	Yes	We split the data into 205,734 training samples, 1,076 development and 1,162 test samples.
Hardware Specification	Yes	we train the models on a server with an NVIDIA Tesla A100 GPU with 40G of memory.
Software Dependencies	No	The paper mentions using the "transformers library (Wolf et al., 2020)" and "Open AI finetuning API" but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We use the validation set to tune the learning rate, which takes value in {2e-6, 5e-6, 1e-5, 2e-5, 5e-5}; dropout rate, which takes value in {0, 0.1, 0.2, 0.3}; and weight decay, which takes value in {1e-4, 1e-5}. We train the models until convergence, which is usually around ten epochs.