reproducibilityindex.ai

K-Nearest-Neighbor Local Sampling Based Conditional Independence Testing

Authors: Shuai Li, Yingjie Zhang, Hongtu Zhu, Christina Wang, Hai Shu, Ziqi Chen, Zhuoran Sun, Yanfeng Yang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive analyses using both synthetic and real data highlight the computational efficiency of the proposed test. Moreover, it outperforms existing state-of-the-art methods in terms of type I and II errors, even in scenarios with high-dimensional conditioning sets.
Researcher Affiliation	Academia	Shuai Li, Yingjie Zhang School of Statistics KLATASDS-MOE East China Normal University Hongtu Zhu Departments of Biostatistics, Statistics, Computer Science, and Genetics The University of North Carolina at Chapel Hill Christina Dan Wang Business Division New York University Shanghai Hai Shu Department of Biostatistics School of Global Public Health New York University Ziqi Chen , Zhuoran Sun, Yanfeng Yang School of Statistics KLATASDS-MOE East China Normal University
Pseudocode	Yes	Algorithm 1 1-Nearest-Neighbor sampling (1-NN(V1, V2, n)); Algorithm 2 Classifier-based CMI Estimator; Algorithm 3 K-Nearest-neighbor local sampling based CI testing
Open Source Code	Yes	Our code is publicly available at: https://github.com/LeeShuaikenwitch/NNLSCIT.
Open Datasets	Yes	We assess the effectiveness of our method along with six SOTA approaches on two specific datasets: the ABALONE dataset [9] and the Flow-Cytometry dataset [39]. The ABALONE dataset [9] ... is publicly available at the UCI Machine Learning Repository and can be downloaded from https://archive.ics.uci.edu/ml/datasets/abalone. The Flow-Cytometry dataset ... can be obtained from the website https://www.science.org/doi/10.1126/science.1105809.
Dataset Splits	Yes	For data sets V1 and V2, we use the 1-NN sampling algorithm 1 to generate a new data set V with n samples. We assign labels l = 1 for all samples in V2 and l = 0 for all samples in V . In this supervised classification task, a binary classifier can be trained using an advanced binary classification model, such as XGBoost [41, 12] or deep neural networks [21]. The classifier produces predicted probability αm = P(l = 1\|Wm) for a given sample Wm, leading to an estimator of the likelihood ratio on Wm given by b L(Wm) = αm/(1 αm). ... Divide Vf into training and testing subsets V train f and V test f , at a ratio of 2:1. Similarly, split Vg into training and testing subsets V train g and V test g , at a ratio of 2:1.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or cloud computing instances with specifications) used for running the experiments.
Software Dependencies	No	The XGBoost classifier was used in all of our experiments. The paper mentions XGBoost but does not specify its version number or other software dependencies with version details.
Experiment Setup	Yes	We set the number of repetitions B = 200 and the neighbor order k = 7 for our tests. The XGBoost classifier was used in all of our experiments. ... We set the significance level to α = 0.05 and report the type I error rate and the testing power under H1 for all methods evaluated in our experiments. All the results are presented as an average over 200 independent trials.