Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Finding significant combinations of features in the presence of categorical covariates
Authors: Laetitia Papaxanthos, Felipe Llinares-López, Dean Bodenham, Karsten Borgwardt
NeurIPS 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | FACS demonstrates superior speed and statistical power on simulated and real-world datasets compared to the state of the art, opening the door to numerous applications in biomedicine. |
| Researcher Affiliation | Academia | Machine Learning and Computational Biology Lab D-BSSE, ETH Zurich |
| Pseudocode | Yes | Algorithm 1 FACS Algorithm 2 tarone_cmh |
| Open Source Code | Yes | code for FACS is available on Git Hub2. 2https://github.com/Borgwardt Lab/FACS |
| Open Datasets | Yes | A. thaliana GWAS: We apply FACS, LAMP-χ2 and Bonf-CMH to two datasets from the plant model organism A. thaliana [1]... The breast cancer data set, as used in [15] |
| Dataset Splits | No | The paper describes generating synthetic datasets and using real-world datasets, but it does not explicitly provide details about training, validation, and test splits (e.g., percentages or sample counts) for reproducibility. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware used to run the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies used in the experiments. |
| Experiment Setup | Yes | We generated synthetic datasets with one truly associated feature subset Strue and one confounded feature subset Sconf to evaluate precision and ability to correct for confounders... We set ρtrue = ρconf = ρ... contain 84 and 95 samples, respectively... Each plant sample is represented by a sequence of approximately 214, 000 genetic bases... we downsampled each of the five chromosomes... by a factor of 20, using 20 different offsets... containing between 1, 423 and 2, 661 features... For both datasets we condition on the ancestry, resulting in k = 5 and k = 3 categories for the covariate... includes 12, 773 genes classified into up-regulated or not up-regulated. Each gene is represented by 397 binary features... Two sets of experiments were conducted, conditioning on 8 and 16 categories respectively. |