Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The third pillar of causal analysis? A measurement perspective on causal representations

Authors: Dingling Yao, Shimeng Huang, Riccardo Cadei, Kun Zhang, Francesco Locatello

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate T-MEX across diverse causal inference scenarios, including numerical simulations and real-world ecological video analysis, demonstrating that the proposed framework and corresponding score effectively assess the identification of learned representations and their usefulness for causal downstream tasks.
Researcher Affiliation	Academia	1Institute of Science and Technology Austria 2Carnegie Mellon University 3Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Pseudocode	Yes	Algorithm 1: Compute T-MEX score from one set of samples
Open Source Code	Yes	Reproducible code can be found at https://github.com/shimenghuang/a-measurement-perspective-of-crl. ... Curated code will be published upon acceptance.
Open Datasets	Yes	The dataset we used in 5.2 is publicly available at https://doi.org/10.6084/m9.figshare.26484934.v2.
Dataset Splits	No	For statistical validity, we compute the results using 50 simulated datasets from each model, with each dataset containing 4096 observations. ... ISTAnt consists of video recordings of ant triplets with occasional grooming behavior. ... Retrieving causally valid representations in this case is challenging as we have more non-annotated than annotated data, as described by (Cadei et al., 2024).
Hardware Specification	Yes	We train the CRL models (model A, B, C) using a single node GPU (NVIDIA Ge Force RTX1080Ti) with 10GB of RAM, 4 CPU cores for less than one GPU hour. ... We run all the analyses in 5.2 using 48GB of RAM, 20 CPU cores, and a single node GPU (NVIDIA Ge Force RTX2080Ti) for 24 GPU hours.
Software Dependencies	No	For both experiments, we estimate T-MEX based on the projected covariance measure (PCM) test (Lundborg et al., 2024) implemented in the python package pycomets (Huang and Kook, 2025)... We run Li NGAM (Shimizu et al., 2006) from causal-learn (Zheng et al., 2024)
Experiment Setup	Yes	Table 2: Hyperparameters for the real-world ecological experiment (5.2 and App. D.2), giving rise to 2,400 model configurations in total. All other settings follow (Cadei et al., 2024, App. C). Hyperparameter Value(s) Input Preprocessing YES / NO Number of Hidden Layers 1, 2 Batch Size 64, 128, 256 Adam: learning rate 5e-2, 1e-2, 5e-3, 1e-3, 5e-4 Training objective Empirical Risk, Invariant Risk (Arjovsky et al., 2020), v REx (Krueger et al., 2021), Deconfounded Risk (Cadei et al., 2025) # Seeds 0,1, ..., 9