End-To-End Causal Effect Estimation from Unstructured Natural Language Data
Authors: Nikita Dhawan, Leonardo Cotta, Karen Ullrich, Rahul G. Krishnan, Chris J. Maddison
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prepare six (two semi-synthetic and four real) observational datasets, paired with corresponding ground truth in the form of randomized trials, which we used to systematically evaluate each step of our pipeline. NATURAL estimators demonstrate remarkable performance, yielding causal effect estimates that fall within 3 percentage points of their ground truth counterparts, including on real-world Phase 3/4 clinical trials. |
| Researcher Affiliation | Collaboration | Nikita Dhawan University of Toronto, Vector Institute nikita@cs.toronto.edu Leonardo Cotta Vector Institute leonardo.cotta@vectorinstitute.ai Karen Ullrich Meta AI karenu@meta.com Rahul G. Krishnan University of Toronto, Vector Institute rahulgk@cs.toronto.edu Chris J. Maddison University of Toronto, Vector Institute cmaddis@cs.toronto.edu |
| Pseudocode | No | The paper includes diagrams illustrating the pipeline steps (e.g., Figure 2) but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | We used two standard, publicly available randomized datasets: Hillstrom [19] and Retail Hero [49], and plan to open-source scripts to generate our data. |
| Open Datasets | Yes | We used two standard, publicly available randomized datasets: Hillstrom [19] and Retail Hero [49]... curated from publicly available Reddit posts from the Push Shift dataset [7]... clinical trials which performed head-to-head comparisons... Semaglutide vs. Tirzepatide [18] and Semaglutide vs. Liraglutide [8]... and Erenumab vs. Topiramate [37] and Onabotulinumtoxin A vs. Topiramate [39]. |
| Dataset Splits | Yes | We used the first of these to validate implementation choices NATURAL (like filtering, imputations, prompt specifications) and the other three as held-out test settings, see appendix C. |
| Hardware Specification | No | The paper states: 'We used GPT-4 Turbo for sampling and LLAMA2-70B for computing conditional probabilities.' These refer to large language models/APIs but do not specify the underlying hardware (e.g., GPU models, CPU types, memory) used to run these models or the authors' own experiments. |
| Software Dependencies | No | The paper mentions using 'GPT-4 Turbo' and 'LLAMA2-70B' which are specific models, but it does not provide version numbers for any software libraries, frameworks, or programming languages used (e.g., Python, PyTorch, etc.). |
| Experiment Setup | Yes | We implemented the entire pipeline as follows: 1. Initial filter. ... 2. Filter by relevance. ... 3. Filter by treatment-outcome. ... 4. Filter known covariates by inclusion criteria. ... 5. Extract known and unknown covariates. ... 6. Infer conditionals. ... 7. Weight reports according to inclusion criteria match. ... 8. Finally, given all the required extractions and conditional probabilities, we required discrete covariates to plug them into our NATURAL estimators. Hence, we converted any continuous covariates into discrete categories. |