Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Invariant Structure Learning for Better Generalization and Causal Explainability

Authors: Yunhao Ge, Sercan O Arik, Jinsung Yoon, Ao Xu, Laurent Itti, Tomas Pfister

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the eﬀectiveness of ISL on various synthetic and real-world datasets. ISL yields state-of-the-art SCM discovery (clearly outperforming alternatives on real-world data) with a particularly prominent improvement for complex graphs structures. In addition, ISL improves the test prediction accuracy throughout, with especially large improvements in cases with signiﬁcant data drifts (up to 80% MSE reduction compared to alternatives). Section 4 Experiments: In this section, we evaluate the proposed ISL framework for causal explainability and better generalization. We conduct extensive experiments in two settings based on the availability of target labels: supervised learning tasks in Sec. 4.1 and self-supervised learning tasks in Sec. 4.2. Details and more results are provided in the Appendix D. Baselines: On causal explainability, we choose NOTEARS-MLP (Zheng et al., 2020), GOLEM (Ng et al., 2020), and No Fear (Wei et al., 2020) as the baselines for learning the SCM which represented as a DAG. On target prediction, we choose a standard MLP and CASTLE (Kyono et al., 2020) as the baseline methods. Metrics: We evaluate the estimated Y-related DAG and whole DAG structure using Structural Hamming Distance (SHD): the number of missing, falsely detected or reversed edges, lower the better. We evaluate the target (Y ) prediction accuracy in Mean Squared Error (MSE). We compute SHD and the errors for multiple times and report the mean value.
Researcher Affiliation	Collaboration	Yunhao Ge , , Sercan Ö. Arık , Jinsung Yoon , Ao Xu , Laurent Itti , and Tomas Pﬁster EMAIL, EMAIL Google Cloud AI, Sunnyvale, CA, USA University of Southern California, Los Angeles, CA, USA
Pseudocode	Yes	Algorithm 1: Supervised Invariant Structure Learning Input: Dataset D Output: DAG, Y predictor f(X) = h θY 1 (X) ... Algorithm 2: Self-Supervised Invariant Structure Learning Input: Dataset D Output: DAG
Open Source Code	Yes	We open-source our code at https://github.com/Aaron Xu9/ISL.git.1 1The implementation is available in https://github.com/Aaron Xu9/ISL.git
Open Datasets	Yes	We perform supervised learning experiments on real-world datasets with GT causal structure: Boston Housing (Binder et al., 1997; bos) and Insurance (Binder et al., 1997; ins) datasets. ... The Sachs dataset is for the discovery of protein signaling network on expression levels of diﬀerent proteins and phospholipids in human cells (Sachs et al., 2005), and is a popular benchmark for causal graph discovery, containing both observational and interventional data. ... http://lib.stat.cmu.edu/datasets/boston. https://link.springer.com/article/10.1023/A:1007421730016. https://www.science.org/doi/full/10.1126/science.1105809.
Dataset Splits	Yes	We perform supervised learning experiments on real-world datasets with GT causal structure: Boston Housing (Binder et al., 1997; bos) and Insurance (Binder et al., 1997; ins) datasets. For each, we randomly split the train/validation/test with the proportion 0.8/0.1/0.1.
Hardware Specification	Yes	The time measurements were obtained on an Apple M1 Pro chip with 16GB of memory.
Software Dependencies	No	The paper mentions mathematical optimization methods like L-BFGS-B (Zhu et al., 1997) and clustering algorithms like K-means (Lloyd, 1982), but it does not specify any software libraries or frameworks used (e.g., PyTorch, TensorFlow, scikit-learn) along with their version numbers, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	We set a minimum edges number Emin and a maximum edges number Emax based on the dataset information. Usually, Emin is half of the number of nodes \|E\|/2 and Emax is 5\|E\|. We also set a range of threshold t [tmin, tmax] and a step size ts base on the value range of W. Usually we use tmin = min(W) and tmax = max(W). ... We choose the value of γ and βi that achieves the smallest target Y reconstruction on the validation set. We ﬁnd the parameters: γ = 1; β1 = 0.001; β2 = 0.01; β3 = 0.01; β4 = 0.01 as reasonable choices across many diﬀerent settings, although they are not extensively optimized.