Capturing the denoising effect of PCA via compression ratio
Authors: Chandra Sekhar Mukherjee, Nikhil Deorkar, Jiapeng Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explain this phenomenon through both theoretical proofs and experiments on real-world data. Building on this new metric, we design a straightforward algorithm that could be used to detect outliers. Roughly speaking, we argue that points that have a lower variance of compression ratio do not share a common signal with others (hence could be considered outliers). We provide theoretical justification for this simple outlier detection algorithm and use simulations to demonstrate that our method is competitive with popular outlier detection tools. Finally, we run experiments on real-world high-dimension noisy data (single-cell RNA-seq) to show that removing points from these datasets via our outlier detection method improves the accuracy of clustering algorithms. |
| Researcher Affiliation | Academia | Chandra Sekhar Mukherjee chandrasekhar.mukherjee@usc.edu Nikhil Deorkar deorkar@usc.edu Jiapeng Zhang jiapengz@usc.edu Thomas Lord Department of Computer Science, University of Southern California. |
| Pseudocode | Yes | Algorithm 1 Outlier detection via variance of compression ratio. Input: data X, PCA dimension k , number of outliers o. for i = 1 to n do SC[i] VAR X,k (xi) {VAR X,k deļ¬ned in Eq. 1} end for return o many indexes with lowest SC values. |
| Open Source Code | Yes | We provide the full source code used to generate the results in the supplementary material. |
| Open Datasets | Yes | We focus on single-cell data... using datasets from a popular benchmark database [DRS20] with ground truth community labels. |
| Dataset Splits | No | The paper describes removing 5% or 10% of points and then evaluating clustering performance on the remainder, but it does not specify a distinct 'validation' dataset split for hyperparameter tuning or model selection. |
| Hardware Specification | Yes | All simulations and experiments were run on a 2020 M1 Mac Book Pro with 16 GB of memory within 1.5 hours of total running time. |
| Software Dependencies | No | The paper mentions running a "standard implementation of KMeans" but does not specify any software versions or libraries (e.g., Python, scikit-learn, PyTorch). |
| Experiment Setup | Yes | For each of the datasets, we do the following. Let k be the number of communities. We apply some c-dimensional PCA and then run a standard implementation of KMeans with k-centers on the post-PCA data and record the NMI and purity score, which are popular clustering accuracy metrics. ... First, we set c = k 1 (following our theory)... Then, we remove 5% of the points according to the outlier score and then run PCA+K-Means on the rest of the dataset and obtain the new NMI values. Next, we repeat the same experiments by removing 10% of the points. Additionally, we also use c = 2k... |