Sobolev Independence Criterion
Authors: Youssef Mroueh, Tom Sercu, Mattia Rigotti, Inkit Padhi, Cicero Nogueira dos Santos
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments validating SIC for feature selection in synthetic and real-world experiments. We show that SIC enables reliable and interpretable discoveries, when used in conjunction with the holdout randomization test and knockoffs to control the False Discovery Rate. Code is available at http://github.com/ibm/sic. and 8 Experiments Synthetic Data Validation. We first validate our methods and compare them to baseline models in simulation studies on synthetic datasets where the ground truth is available by construction. |
| Researcher Affiliation | Collaboration | Youssef Mroueh, Tom Sercu, Mattia Rigotti, Inkit Padhi, Cicero Dos Santos IBM Research & MIT-IBM Watson AI lab mroueh,mrigotti@us.ibm.com,inkit.padhi@ibm.com |
| Pseudocode | Yes | Algorithm 3 in Appendix B summarizes our stochastic BCD algorithm for training the Neural SIC. The algorithm consists of SGD updates to and mirror descent updates to . and The principle in HRT [8] that we specify here for SIC in Algorithm 4 (given in Appendix B) is the following: instead of refitting SIC under H0, we evaluate the mean of the witness function of SIC on a holdout set from the real distribution gives us p-values. |
| Open Source Code | Yes | Code is available at http://github.com/ibm/sic. |
| Open Datasets | Yes | We experiment with two datasets: A) Complex multivariate synthetic data (Sin Exp)... B) Liang Dataset. We show results on the benchmark dataset proposed by [34]... We consider as a real-world application the Cancer Cell Line Encyclopedia (CCLE) dataset [36]... The second real-world dataset that we analyze is the HIV-1 Drug Resistance[38]... |
| Dataset Splits | Yes | Table 1 shows the heldout MSE of a predictor trained on selected features, averaged over 100 runs (each run: new randomized 90%/10% data split, NN initialization). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments. |
| Software Dependencies | No | The paper mentions software like 'scikit-learn' and 'PyTorch' but does not specify their version numbers, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We train all neural networks used in this work via the Adam optimizer [44] with a learning rate of 1e-4 for 25 epochs. We use PyTorch [45] for all neural network implementations. and We use Boosted SIC, by varying the batch sizes in N 2 {10, 30, 50}, and computing the geometric mean of produced by those three setups as the feature importance needed for Knockoffs. |