Finding Statistically Significant Interactions between Continuous Features

Authors: Mahito Sugiyama, Karsten Borgwardt

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We examine the effectiveness and the efficiency of C-Tarone using synthetic and real-world datasets.
Researcher Affiliation Academia 1National Institute of Informatics, Tokyo 101-8430, Japan 2JST PRESTO, Japan 3D-BSSE, ETH Z urich, Basel 4058, Switzerland 4SIB Swiss Institute of Bioinformatics, Switzerland
Pseudocode Yes Algorithm 1: C-Tarone.
Open Source Code No The paper states "All methods were implemented in C/C++ and compiled with gcc 4.8.5", but it does not provide any link to source code, nor does it explicitly state that the code will be made open source or available in supplementary materials.
Open Datasets Yes We also evaluate C-Tarone on real-world datasets shown in Table 2 in Appendix, which are benchmark datasets for binary classification from the UCI repository [Lichman, 2013].
Dataset Splits No The paper does not provide specific details on how the datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or cross-validation schemes) for reproducibility.
Hardware Specification Yes We used Amazon Linux AMI release 2017.09 and ran all experiments on a single core of 2.3 GHz Intel Xeon CPU E7-8880 v3 and 2.0 TB of memory.
Software Dependencies Yes All methods were implemented in C/C++ and compiled with gcc 4.8.5.
Experiment Setup Yes The FWER level α = 0.05 throughout experiments. In each dataset, we generate 20% of features that are associated with the class labels. More precisely, first we generate the entire dataset from the uniform distribution from 0 to 1 and assign the class label 1 to the first N1 data point. Then, for the N1 data points in the class 1, we pick up one of the 20% of associated features and copy it to every associated feature with adding Gaussian noise with (µ, σ2) = (0, 0.1). We used the rpart function in R with its default parameter setting, where the Gini index is used for splitting and the minimum number of data points that must exist in a node is 20.