Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Disentangling Interpretable Factors with Supervised Independent Subspace Principal Component Analysis
Authors: Jiayu Su, David A Knowles, Raúl Rabadán
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate sis PCA s connections with autoencoders and regularized linear regression and showcase its ability to identify and separate hidden data structures through extensive applications, including breast cancer diagnosis from image features, learning aging-associated DNA methylation changes, and single-cell analysis of malaria infection. |
| Researcher Affiliation | Academia | Jiayu Su1,2,5 David A. Knowles2,4,5 Raul Rabadan1,2,3 1Program for Mathematical Genomics; 2Department of Systems Biology, Columbia University 3Department of Biomedical Informatics, Columbia University 4Department of Computer Science, Columbia University 5New York Genome Center |
| Pseudocode | Yes | Algorithm 1 Solving sis PCA-linear using alternating eigendecomposition |
| Open Source Code | Yes | 1A Python implementation of sis PCA is available on Git Hub at https://github.com/Jiayu Su PKU/sispca (DOI 10.5281/zenodo.13932660). The repository also includes notebooks to reproduce results in this paper. |
| Open Datasets | Yes | 2uciml/breast-cancer-wisconsin-data, CC BY-NC-SA 4.0 license. |
| Dataset Splits | No | The paper describes data preprocessing steps, such as using the top 2,000 highly variable genes for scRNA-seq data and downsampling the TCGA dataset, but it does not specify explicit training, validation, and test dataset splits. |
| Hardware Specification | Yes | We ran all provided notebooks (https://github.com/Jiayu Su PKU/sispca/tree/main/docs/source/tutorials) using a personal M1 Macbook Air with 16GB RAM and completed most analysis steps in minutes including model training. |
| Software Dependencies | Yes | scvi-tools v1.2.0 (https://scvi-tools.org/) |
| Experiment Setup | Yes | Autoencoders for latent mean and variance: One hidden layer with 128 hidden units, Re LU activation, batch normalization, and dropout. Predictor design (adapted from the sc VIGen QCModel in the HCV paper): One hidden layer with 25 neurons, Re LU activation, and dropout. |