Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Invariant Structure Learning for Better Generalization and Causal Explainability

Authors: Yunhao Ge, Sercan O Arik, Jinsung Yoon, Ao Xu, Laurent Itti, Tomas Pfister

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of ISL on various synthetic and real-world datasets. ISL yields state-of-the-art SCM discovery (clearly outperforming alternatives on real-world data) with a particularly prominent improvement for complex graphs structures. In addition, ISL improves the test prediction accuracy throughout, with especially large improvements in cases with significant data drifts (up to 80% MSE reduction compared to alternatives). Section 4 Experiments: In this section, we evaluate the proposed ISL framework for causal explainability and better generalization. We conduct extensive experiments in two settings based on the availability of target labels: supervised learning tasks in Sec. 4.1 and self-supervised learning tasks in Sec. 4.2. Details and more results are provided in the Appendix D. Baselines: On causal explainability, we choose NOTEARS-MLP (Zheng et al., 2020), GOLEM (Ng et al., 2020), and No Fear (Wei et al., 2020) as the baselines for learning the SCM which represented as a DAG. On target prediction, we choose a standard MLP and CASTLE (Kyono et al., 2020) as the baseline methods. Metrics: We evaluate the estimated Y-related DAG and whole DAG structure using Structural Hamming Distance (SHD): the number of missing, falsely detected or reversed edges, lower the better. We evaluate the target (Y ) prediction accuracy in Mean Squared Error (MSE). We compute SHD and the errors for multiple times and report the mean value.
Researcher Affiliation Collaboration Yunhao Ge , , Sercan Ö. Arık , Jinsung Yoon , Ao Xu , Laurent Itti , and Tomas Pfister EMAIL, EMAIL Google Cloud AI, Sunnyvale, CA, USA University of Southern California, Los Angeles, CA, USA
Pseudocode Yes Algorithm 1: Supervised Invariant Structure Learning Input: Dataset D Output: DAG, Y predictor f(X) = h θY 1 (X) ... Algorithm 2: Self-Supervised Invariant Structure Learning Input: Dataset D Output: DAG
Open Source Code Yes We open-source our code at https://github.com/Aaron Xu9/ISL.git.1 1The implementation is available in https://github.com/Aaron Xu9/ISL.git
Open Datasets Yes We perform supervised learning experiments on real-world datasets with GT causal structure: Boston Housing (Binder et al., 1997; bos) and Insurance (Binder et al., 1997; ins) datasets. ... The Sachs dataset is for the discovery of protein signaling network on expression levels of different proteins and phospholipids in human cells (Sachs et al., 2005), and is a popular benchmark for causal graph discovery, containing both observational and interventional data. ... http://lib.stat.cmu.edu/datasets/boston. https://link.springer.com/article/10.1023/A:1007421730016. https://www.science.org/doi/full/10.1126/science.1105809.
Dataset Splits Yes We perform supervised learning experiments on real-world datasets with GT causal structure: Boston Housing (Binder et al., 1997; bos) and Insurance (Binder et al., 1997; ins) datasets. For each, we randomly split the train/validation/test with the proportion 0.8/0.1/0.1.
Hardware Specification Yes The time measurements were obtained on an Apple M1 Pro chip with 16GB of memory.
Software Dependencies No The paper mentions mathematical optimization methods like L-BFGS-B (Zhu et al., 1997) and clustering algorithms like K-means (Lloyd, 1982), but it does not specify any software libraries or frameworks used (e.g., PyTorch, TensorFlow, scikit-learn) along with their version numbers, which are necessary for reproducible software dependencies.
Experiment Setup Yes We set a minimum edges number Emin and a maximum edges number Emax based on the dataset information. Usually, Emin is half of the number of nodes |E|/2 and Emax is 5|E|. We also set a range of threshold t [tmin, tmax] and a step size ts base on the value range of W. Usually we use tmin = min(W) and tmax = max(W). ... We choose the value of γ and βi that achieves the smallest target Y reconstruction on the validation set. We find the parameters: γ = 1; β1 = 0.001; β2 = 0.01; β3 = 0.01; β4 = 0.01 as reasonable choices across many different settings, although they are not extensively optimized.