Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference
Authors: Aniket Vashishtha, Abbavaram Gowtham Reddy, Abhinav Kumar, Saketh Bachu, Vineeth Balasubramanian, Amit Sharma
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS AND RESULTS Datasets. We evaluate the triplet method using benchmark datasets from the BNLearn repository (Scutari & Denis, 2014): Earthquake, Cancer, Survey, Asia, Asia modified (Asia-M), and Child. Across multiple real-world graphs, such a triplet-based method yields a more accurate order than the pairwise prompt, using both LLMs and human annotators. |
| Researcher Affiliation | Collaboration | 1UIUC, 2CISPA Helmholtz Center for Information Security, Germany, 3MIT, 4IIT Hyderabad, India, 5Microsoft Research, India EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Integrating ˆπ in constraintbased methods... Algorithm 2 Integrating ˆπ in score-based methods |
| Open Source Code | Yes | Code: https://github.com/Aniket Vashishtha/Causal_Order_Imperfect_Experts. |
| Open Datasets | Yes | Datasets. We evaluate the triplet method using benchmark datasets from the BNLearn repository (Scutari & Denis, 2014): Earthquake, Cancer, Survey, Asia, Asia modified (Asia-M), and Child... Neuropathic dataset (Tu et al., 2019)... Alzheimers: This graph (refer Figure A9)... (Abdulaal et al., 2024)... Covid-19: This graph... (Mascaro et al., 2022). |
| Dataset Splits | No | The information is insufficient as the paper mentions using various sample sizes for evaluation (e.g., "across five different sample sizes: 250, 500, 1000, 5000, 10000") but does not provide explicit training/test/validation splits for these datasets. |
| Hardware Specification | No | The information is insufficient. The paper mentions using various LLMs (GPT-3.5-turbo, GPT-4, Phi-3, and Llama3) and various causal discovery algorithms, but it does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used for running the experiments or computations. |
| Software Dependencies | No | The information is insufficient. The paper mentions using the BNLearn repository (Scutari & Denis, 2014) and the Do Why library (Sharma & Kiciman, 2020), as well as several causal discovery algorithms (PC, SCORE, ICA-Li NGAM, Direct-Li NGAM, NOTEARS, Ca MML), but it does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | For every edge, we leverage the votes from the triplet prompts to establish a probability distribution over edge orientations. We use this to compute entropy for each edge, removing those with higher entropy (lower confidence). To minimize Dtop, we prune edges with entropy below the mean of all entropies... We use the causal order ˆπ obtained from experts as a level order prior (Defn 3.4) to such methods. We handle any cycles in the expert s output by assigning all nodes in a cycle to the same level... Once the adjustment set is identified, the causal effect is estimated using the Do Why library (Sharma & Kiciman, 2020) and linear regression as the estimator. |