reproducibilityindex.ai

End-To-End Causal Effect Estimation from Unstructured Natural Language Data

Authors: Nikita Dhawan, Leonardo Cotta, Karen Ullrich, Rahul G. Krishnan, Chris J. Maddison

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prepare six (two semi-synthetic and four real) observational datasets, paired with corresponding ground truth in the form of randomized trials, which we used to systematically evaluate each step of our pipeline. NATURAL estimators demonstrate remarkable performance, yielding causal effect estimates that fall within 3 percentage points of their ground truth counterparts, including on real-world Phase 3/4 clinical trials.
Researcher Affiliation	Collaboration	Nikita Dhawan University of Toronto, Vector Institute nikita@cs.toronto.edu Leonardo Cotta Vector Institute leonardo.cotta@vectorinstitute.ai Karen Ullrich Meta AI karenu@meta.com Rahul G. Krishnan University of Toronto, Vector Institute rahulgk@cs.toronto.edu Chris J. Maddison University of Toronto, Vector Institute cmaddis@cs.toronto.edu
Pseudocode	No	The paper includes diagrams illustrating the pipeline steps (e.g., Figure 2) but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	We used two standard, publicly available randomized datasets: Hillstrom [19] and Retail Hero [49], and plan to open-source scripts to generate our data.
Open Datasets	Yes	We used two standard, publicly available randomized datasets: Hillstrom [19] and Retail Hero [49]... curated from publicly available Reddit posts from the Push Shift dataset [7]... clinical trials which performed head-to-head comparisons... Semaglutide vs. Tirzepatide [18] and Semaglutide vs. Liraglutide [8]... and Erenumab vs. Topiramate [37] and Onabotulinumtoxin A vs. Topiramate [39].
Dataset Splits	Yes	We used the first of these to validate implementation choices NATURAL (like filtering, imputations, prompt specifications) and the other three as held-out test settings, see appendix C.
Hardware Specification	No	The paper states: 'We used GPT-4 Turbo for sampling and LLAMA2-70B for computing conditional probabilities.' These refer to large language models/APIs but do not specify the underlying hardware (e.g., GPU models, CPU types, memory) used to run these models or the authors' own experiments.
Software Dependencies	No	The paper mentions using 'GPT-4 Turbo' and 'LLAMA2-70B' which are specific models, but it does not provide version numbers for any software libraries, frameworks, or programming languages used (e.g., Python, PyTorch, etc.).
Experiment Setup	Yes	We implemented the entire pipeline as follows: 1. Initial filter. ... 2. Filter by relevance. ... 3. Filter by treatment-outcome. ... 4. Filter known covariates by inclusion criteria. ... 5. Extract known and unknown covariates. ... 6. Infer conditionals. ... 7. Weight reports according to inclusion criteria match. ... 8. Finally, given all the required extractions and conditional probabilities, we required discrete covariates to plug them into our NATURAL estimators. Hence, we converted any continuous covariates into discrete categories.