Capturing the denoising effect of PCA via compression ratio

Authors: Chandra Sekhar Mukherjee, Nikhil Deorkar, Jiapeng Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We explain this phenomenon through both theoretical proofs and experiments on real-world data. Building on this new metric, we design a straightforward algorithm that could be used to detect outliers. Roughly speaking, we argue that points that have a lower variance of compression ratio do not share a common signal with others (hence could be considered outliers). We provide theoretical justification for this simple outlier detection algorithm and use simulations to demonstrate that our method is competitive with popular outlier detection tools. Finally, we run experiments on real-world high-dimension noisy data (single-cell RNA-seq) to show that removing points from these datasets via our outlier detection method improves the accuracy of clustering algorithms.
Researcher Affiliation Academia Chandra Sekhar Mukherjee chandrasekhar.mukherjee@usc.edu Nikhil Deorkar deorkar@usc.edu Jiapeng Zhang jiapengz@usc.edu Thomas Lord Department of Computer Science, University of Southern California.
Pseudocode Yes Algorithm 1 Outlier detection via variance of compression ratio. Input: data X, PCA dimension k , number of outliers o. for i = 1 to n do SC[i] VAR X,k (xi) {VAR X,k defined in Eq. 1} end for return o many indexes with lowest SC values.
Open Source Code Yes We provide the full source code used to generate the results in the supplementary material.
Open Datasets Yes We focus on single-cell data... using datasets from a popular benchmark database [DRS20] with ground truth community labels.
Dataset Splits No The paper describes removing 5% or 10% of points and then evaluating clustering performance on the remainder, but it does not specify a distinct 'validation' dataset split for hyperparameter tuning or model selection.
Hardware Specification Yes All simulations and experiments were run on a 2020 M1 Mac Book Pro with 16 GB of memory within 1.5 hours of total running time.
Software Dependencies No The paper mentions running a "standard implementation of KMeans" but does not specify any software versions or libraries (e.g., Python, scikit-learn, PyTorch).
Experiment Setup Yes For each of the datasets, we do the following. Let k be the number of communities. We apply some c-dimensional PCA and then run a standard implementation of KMeans with k-centers on the post-PCA data and record the NMI and purity score, which are popular clustering accuracy metrics. ... First, we set c = k 1 (following our theory)... Then, we remove 5% of the points according to the outlier score and then run PCA+K-Means on the rest of the dataset and obtain the new NMI values. Next, we repeat the same experiments by removing 10% of the points. Additionally, we also use c = 2k...