Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference

Authors: Aniket Vashishtha, Abbavaram Gowtham Reddy, Abhinav Kumar, Saketh Bachu, Vineeth Balasubramanian, Amit Sharma

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 EXPERIMENTS AND RESULTS Datasets. We evaluate the triplet method using benchmark datasets from the BNLearn repository (Scutari & Denis, 2014): Earthquake, Cancer, Survey, Asia, Asia modified (Asia-M), and Child. Across multiple real-world graphs, such a triplet-based method yields a more accurate order than the pairwise prompt, using both LLMs and human annotators.
Researcher Affiliation	Collaboration	1UIUC, 2CISPA Helmholtz Center for Information Security, Germany, 3MIT, 4IIT Hyderabad, India, 5Microsoft Research, India EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Integrating ˆπ in constraintbased methods... Algorithm 2 Integrating ˆπ in score-based methods
Open Source Code	Yes	Code: https://github.com/Aniket Vashishtha/Causal_Order_Imperfect_Experts.
Open Datasets	Yes	Datasets. We evaluate the triplet method using benchmark datasets from the BNLearn repository (Scutari & Denis, 2014): Earthquake, Cancer, Survey, Asia, Asia modified (Asia-M), and Child... Neuropathic dataset (Tu et al., 2019)... Alzheimers: This graph (refer Figure A9)... (Abdulaal et al., 2024)... Covid-19: This graph... (Mascaro et al., 2022).
Dataset Splits	No	The information is insufficient as the paper mentions using various sample sizes for evaluation (e.g., "across five different sample sizes: 250, 500, 1000, 5000, 10000") but does not provide explicit training/test/validation splits for these datasets.
Hardware Specification	No	The information is insufficient. The paper mentions using various LLMs (GPT-3.5-turbo, GPT-4, Phi-3, and Llama3) and various causal discovery algorithms, but it does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used for running the experiments or computations.
Software Dependencies	No	The information is insufficient. The paper mentions using the BNLearn repository (Scutari & Denis, 2014) and the Do Why library (Sharma & Kiciman, 2020), as well as several causal discovery algorithms (PC, SCORE, ICA-Li NGAM, Direct-Li NGAM, NOTEARS, Ca MML), but it does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	For every edge, we leverage the votes from the triplet prompts to establish a probability distribution over edge orientations. We use this to compute entropy for each edge, removing those with higher entropy (lower confidence). To minimize Dtop, we prune edges with entropy below the mean of all entropies... We use the causal order ˆπ obtained from experts as a level order prior (Defn 3.4) to such methods. We handle any cycles in the expert s output by assigning all nodes in a cycle to the same level... Once the adjustment set is identified, the causal effect is estimated using the Do Why library (Sharma & Kiciman, 2020) and linear regression as the estimator.