Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
The third pillar of causal analysis? A measurement perspective on causal representations
Authors: Dingling Yao, Shimeng Huang, Riccardo Cadei, Kun Zhang, Francesco Locatello
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate T-MEX across diverse causal inference scenarios, including numerical simulations and real-world ecological video analysis, demonstrating that the proposed framework and corresponding score effectively assess the identification of learned representations and their usefulness for causal downstream tasks. |
| Researcher Affiliation | Academia | 1Institute of Science and Technology Austria 2Carnegie Mellon University 3Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) |
| Pseudocode | Yes | Algorithm 1: Compute T-MEX score from one set of samples |
| Open Source Code | Yes | Reproducible code can be found at https://github.com/shimenghuang/a-measurement-perspective-of-crl. ... Curated code will be published upon acceptance. |
| Open Datasets | Yes | The dataset we used in 5.2 is publicly available at https://doi.org/10.6084/m9.figshare.26484934.v2. |
| Dataset Splits | No | For statistical validity, we compute the results using 50 simulated datasets from each model, with each dataset containing 4096 observations. ... ISTAnt consists of video recordings of ant triplets with occasional grooming behavior. ... Retrieving causally valid representations in this case is challenging as we have more non-annotated than annotated data, as described by (Cadei et al., 2024). |
| Hardware Specification | Yes | We train the CRL models (model A, B, C) using a single node GPU (NVIDIA Ge Force RTX1080Ti) with 10GB of RAM, 4 CPU cores for less than one GPU hour. ... We run all the analyses in 5.2 using 48GB of RAM, 20 CPU cores, and a single node GPU (NVIDIA Ge Force RTX2080Ti) for 24 GPU hours. |
| Software Dependencies | No | For both experiments, we estimate T-MEX based on the projected covariance measure (PCM) test (Lundborg et al., 2024) implemented in the python package pycomets (Huang and Kook, 2025)... We run Li NGAM (Shimizu et al., 2006) from causal-learn (Zheng et al., 2024) |
| Experiment Setup | Yes | Table 2: Hyperparameters for the real-world ecological experiment (5.2 and App. D.2), giving rise to 2,400 model configurations in total. All other settings follow (Cadei et al., 2024, App. C). Hyperparameter Value(s) Input Preprocessing YES / NO Number of Hidden Layers 1, 2 Batch Size 64, 128, 256 Adam: learning rate 5e-2, 1e-2, 5e-3, 1e-3, 5e-4 Training objective Empirical Risk, Invariant Risk (Arjovsky et al., 2020), v REx (Krueger et al., 2021), Deconfounded Risk (Cadei et al., 2025) # Seeds 0,1, ..., 9 |