Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PCA++: How Uniformity Induces Robustness to Background Noise in Contrastive Learning

Authors: Mingqi Wu, Qiang Sun, Archer Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, PCA++ outperforms standard PCA and alignment-only PCA+ on simulations, corrupted-MNIST, and single-cell transcriptomics, reliably recovering condition-invariant structure. More broadly, we clarify uniformity s role in contrastive learning, showing that explicit feature dispersion defends against structured noise and enhances robustness.
Researcher Affiliation Academia Mingqi Wu Mc Gill University Mila EMAIL Qiang Sun University of Toronto MBZUAI EMAIL Corresponding author Archer Y. Yang Mc Gill University Mila EMAIL Corresponding author
Pseudocode Yes For implementation details see Algorithm 1 in Appendix B.2, and for a full derivation of generalized eigenvalue solvers refer to [38, 19].
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code and data are publicly released at the time of submission. The experiments are based on standard datasets and use well-established implementations as described in the supplementary material.
Open Datasets Yes We created a synthetic dataset of 5,000 paired images by superimposing MNIST digits [16] ( 0 or 1 ) onto distinct Image Net [15] "grass" patches. ... We evaluated PCA++ on single-cell RNA-seq data from [31], comprising 14,619 control and 14,446 IFN-β stimulated PBMCs across eight immune cell types.
Dataset Splits Yes We varied the sample size n {100, 500, 5000}, keeping d = 0.4 n, and performed 50 independent trials for each setting. Subspace error was measured by the largest principal angle between the estimated and true signal subspaces.
Hardware Specification Yes Benchmarked on an Intel Xeon CPU @ 2.20GHz.
Software Dependencies No Using an iterative solver like the implicitly restarted Lanczos method (IRLM). This makes the approach scalable to very high-dimensional data, and the details of its computational complexity are discussed in Appendix D. ... using the Seurat v3 method implemented in Scanpy.
Experiment Setup Yes Fixing n = 500 and varying the aspect ratio d/n, we embedded a five-dimensional signal subspace (variances [50, 25, 20, 15, 10]) in the first five coordinates and an orthogonal five-dimensional background (variances [500, 400, 300, 200, 100]) in the last five. Applying PCA++ with truncation rank s = 10, we again computed the sine of the largest principal angle to the true signal, averaged over fifty runs.