Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PCA++: How Uniformity Induces Robustness to Background Noise in Contrastive Learning
Authors: Mingqi Wu, Qiang Sun, Archer Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, PCA++ outperforms standard PCA and alignment-only PCA+ on simulations, corrupted-MNIST, and single-cell transcriptomics, reliably recovering condition-invariant structure. More broadly, we clarify uniformity s role in contrastive learning, showing that explicit feature dispersion defends against structured noise and enhances robustness. |
| Researcher Affiliation | Academia | Mingqi Wu Mc Gill University Mila EMAIL Qiang Sun University of Toronto MBZUAI EMAIL Corresponding author Archer Y. Yang Mc Gill University Mila EMAIL Corresponding author |
| Pseudocode | Yes | For implementation details see Algorithm 1 in Appendix B.2, and for a full derivation of generalized eigenvalue solvers refer to [38, 19]. |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code and data are publicly released at the time of submission. The experiments are based on standard datasets and use well-established implementations as described in the supplementary material. |
| Open Datasets | Yes | We created a synthetic dataset of 5,000 paired images by superimposing MNIST digits [16] ( 0 or 1 ) onto distinct Image Net [15] "grass" patches. ... We evaluated PCA++ on single-cell RNA-seq data from [31], comprising 14,619 control and 14,446 IFN-β stimulated PBMCs across eight immune cell types. |
| Dataset Splits | Yes | We varied the sample size n {100, 500, 5000}, keeping d = 0.4 n, and performed 50 independent trials for each setting. Subspace error was measured by the largest principal angle between the estimated and true signal subspaces. |
| Hardware Specification | Yes | Benchmarked on an Intel Xeon CPU @ 2.20GHz. |
| Software Dependencies | No | Using an iterative solver like the implicitly restarted Lanczos method (IRLM). This makes the approach scalable to very high-dimensional data, and the details of its computational complexity are discussed in Appendix D. ... using the Seurat v3 method implemented in Scanpy. |
| Experiment Setup | Yes | Fixing n = 500 and varying the aspect ratio d/n, we embedded a five-dimensional signal subspace (variances [50, 25, 20, 15, 10]) in the first five coordinates and an orthogonal five-dimensional background (variances [500, 400, 300, 200, 100]) in the last five. Applying PCA++ with truncation rank s = 10, we again computed the sine of the largest principal angle to the true signal, averaged over fifty runs. |