COLD: Causal reasOning in cLosed Daily activities
Authors: Abhinav Joshi, areeb ahmad, Ashutosh Modi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate multiple LLMs on the created causal queries and find that causal reasoning is challenging even for activities trivial to humans. We further explore (the causal reasoning abilities of LLMs) using the backdoor criterion to determine the causal strength between events. |
| Researcher Affiliation | Academia | Abhinav Joshi Areeb Ahmad Ashutosh Modi Department of Computer Science and Engineering Indian Institute of Technology Kanpur (IIT Kanpur) Kanpur, India {ajoshi,areeb,ashutoshm}@cse.iitk.ac.in |
| Pseudocode | Yes | App. B, Algorithm 1 presents the mechanism to create a dataset of causal query triplets. We start by constructing the set of possible triplets, and sort the nodes in every triplet using the topological order preset in the observational graph (a DAG). |
| Open Source Code | Yes | We release the framework, model code, and results via https://github.com/Exploration-Lab/COLD. |
| Open Datasets | Yes | We use a script corpus called De Script [Wanzare et al., 2016] for creating the observational graphs. |
| Dataset Splits | No | The paper evaluates pre-trained models on a created dataset but does not define training or validation splits for its own model training, as it does not train new models. |
| Hardware Specification | Yes | We perform all the experiments using a machine with 5 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper lists the specific pre-trained language models used (e.g., gpt-neo-125M, Llama-2-7b-chat-hf), but does not provide specific version numbers for general software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch). |
| Experiment Setup | No | The paper describes its prompt-based evaluation scheme for pre-trained language models but does not provide specific hyperparameters or training configurations, as it does not train any new models. |