Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Hardness of Conditional Independence Testing In Practice

Authors: Zheng He, Roman Pogodin, Yazhe Li, Namrata Deka, Arthur Gretton, Danica J. Sutherland

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 3 illustrates how the analytic results KCI, σ2 H1, and the corresponding SNR vary with the squared kernel lengthscale ℓ2 C in the synthetic example (7), where the optimal ℓ2 C is obtained by maximizing the SNR. As shown in Figure 3 (right), the theoretical power curve (SNR vs. ℓ2 C) closely tracks the empirical power curve ((1 Type-II error) vs. ℓ2 C), indicating that the selected ℓ2 C remains effective in practice under the alternative hypothesis. Although maximizing test power is effective under the alternative, it can substantially inflate Type-I error in CI testing. We conducted experiments on the UTKFace dataset (Z. Zhang et al., 2017), following the setup of Y. Zhang et al. (2025).
Researcher Affiliation Collaboration Zheng He UBC EMAIL Roman Pogodin Mc Gill and Mila EMAIL Yazhe Li Microsoft AI EMAIL Namrata Deka CMU EMAIL Arthur Gretton Gatsby Unit, UCL EMAIL Danica J. Sutherland UBC and Amii EMAIL
Pseudocode No The paper describes methods and procedures in narrative text and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code used in our synthetic data experiments is publicly available at: https://github.com/he-zh/kci-hardness.
Open Datasets Yes We conducted experiments on the UTKFace dataset (Z. Zhang et al., 2017), following the setup of Y. Zhang et al. (2025).
Dataset Splits Yes The data are divided into an independent training set of size m and a test set of size n. Table 1 summarizes empirical results with 200 training points for regression and 200 test points, averaged over 100 runs, with 500 training epochs. We further compare standard KCI with fixed lengthscales and KCI with power-maximized kernel selection across training sizes 200 1000, keeping the test size fixed at 200. The dataset is split into ten subsets, each with its own training and test partition.
Hardware Specification No The amount of computation used was very limited by modern standards. (Question 8 in NeurIPS checklist)
Software Dependencies No The paper mentions 'libraries such as sklearn' (Appendix F.3) but does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup Yes The significance level is set at α = 0.05. Table 1 summarizes empirical results with 200 training points for regression and 200 test points, averaged over 100 runs, with 500 training epochs. Kernel ridge regression with leave-one-out validation selects kernels for k C A and k C B, with per-dimension lengthscales. Power maximization is utilized to select k C kernel. Gaussian kernels are used for all kernels throughout.