COLD: Causal reasOning in cLosed Daily activities

Authors: Abhinav Joshi, areeb ahmad, Ashutosh Modi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate multiple LLMs on the created causal queries and find that causal reasoning is challenging even for activities trivial to humans. We further explore (the causal reasoning abilities of LLMs) using the backdoor criterion to determine the causal strength between events.
Researcher Affiliation Academia Abhinav Joshi Areeb Ahmad Ashutosh Modi Department of Computer Science and Engineering Indian Institute of Technology Kanpur (IIT Kanpur) Kanpur, India {ajoshi,areeb,ashutoshm}@cse.iitk.ac.in
Pseudocode Yes App. B, Algorithm 1 presents the mechanism to create a dataset of causal query triplets. We start by constructing the set of possible triplets, and sort the nodes in every triplet using the topological order preset in the observational graph (a DAG).
Open Source Code Yes We release the framework, model code, and results via https://github.com/Exploration-Lab/COLD.
Open Datasets Yes We use a script corpus called De Script [Wanzare et al., 2016] for creating the observational graphs.
Dataset Splits No The paper evaluates pre-trained models on a created dataset but does not define training or validation splits for its own model training, as it does not train new models.
Hardware Specification Yes We perform all the experiments using a machine with 5 NVIDIA A100 GPUs.
Software Dependencies No The paper lists the specific pre-trained language models used (e.g., gpt-neo-125M, Llama-2-7b-chat-hf), but does not provide specific version numbers for general software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch).
Experiment Setup No The paper describes its prompt-based evaluation scheme for pre-trained language models but does not provide specific hyperparameters or training configurations, as it does not train any new models.