reproducibilityindex.ai

COLD: Causal reasOning in cLosed Daily activities

Authors: Abhinav Joshi, areeb ahmad, Ashutosh Modi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate multiple LLMs on the created causal queries and find that causal reasoning is challenging even for activities trivial to humans. We further explore (the causal reasoning abilities of LLMs) using the backdoor criterion to determine the causal strength between events.
Researcher Affiliation	Academia	Abhinav Joshi Areeb Ahmad Ashutosh Modi Department of Computer Science and Engineering Indian Institute of Technology Kanpur (IIT Kanpur) Kanpur, India {ajoshi,areeb,ashutoshm}@cse.iitk.ac.in
Pseudocode	Yes	App. B, Algorithm 1 presents the mechanism to create a dataset of causal query triplets. We start by constructing the set of possible triplets, and sort the nodes in every triplet using the topological order preset in the observational graph (a DAG).
Open Source Code	Yes	We release the framework, model code, and results via https://github.com/Exploration-Lab/COLD.
Open Datasets	Yes	We use a script corpus called De Script [Wanzare et al., 2016] for creating the observational graphs.
Dataset Splits	No	The paper evaluates pre-trained models on a created dataset but does not define training or validation splits for its own model training, as it does not train new models.
Hardware Specification	Yes	We perform all the experiments using a machine with 5 NVIDIA A100 GPUs.
Software Dependencies	No	The paper lists the specific pre-trained language models used (e.g., gpt-neo-125M, Llama-2-7b-chat-hf), but does not provide specific version numbers for general software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch).
Experiment Setup	No	The paper describes its prompt-based evaluation scheme for pre-trained language models but does not provide specific hyperparameters or training configurations, as it does not train any new models.