Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Path-Based Spectral Clustering: Guarantees, Robustness to Outliers, and Fast Algorithms

Authors: Anna Little, Mauro Maggioni, James M. Murphy

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed method is demonstrated on a variety of synthetic and real data sets, with performance consistently with our theoretical results. Article outline. ... Numerical experiments on representative data sets appear in Section 7.
Researcher Affiliation Academia Anna Little EMAIL Department of Computational Mathematics, Science, and Engineering Michigan State University, East Lansing, MI 48824, USA Mauro Maggioni EMAIL Department of Applied Mathematics and Statistics, Department of Mathematics Johns Hopkins University, Baltimore, MD 21218, USA James M. Murphy EMAIL Department of Mathematics Tufts University, Medford, MA 02139, USA
Pseudocode Yes Algorithm 1 Spectral clustering with metric ρ Input: {xi}n i=1 (data) , σ > 0 (scaling parameter), fσ (kernel function) Output: Y (Labels) 1: Compute the weight matrix W Rn n with Wij = fσ(ρ(xi, xj)).
Open Source Code Yes Matlab code implementing both the fast LLPD nearest neighbor searches and LLPD spectral clustering is publicly available at https://bitbucket.org/annavlittle/llpd_code/ branch/v2.1.
Open Datasets Yes Skins This large data set consists of RGB values corresponding to pixels sampled from two classes: human skin and other2. The human skin samples are widely sampled with respect to age, gender, and skin color; see Bhatt et al. (2009) for details on the construction of the data set. This data set consists of 245057 data points in D = 3 dimensions, corresponding to the RGB values. Note LLPD was approximated from scales {ts}m s=1 defined by 10 percentiles, as opposed to the default exponential scaling. See Figure 11a. 2. https://archive.ics.uci.edu/ml/datasets/skin+segmentation. Driv Face The Driv Face data set is publicly available3 from the UCI Machine Learning Repository (Lichman, 2013). This data set consists of 606 80 80 pixel images of the faces of four drivers, 2 male and 2 female. See Figure 11b 3. https://archive.ics.uci.edu/ml/datasets/Driv Face. COIL The COIL (Columbia University Image Library) data set4 consists of images of 20 different objects captured at varying angles (Nene et al., 1996). There are 1440 different data points, each of which is a 32 32 image, thought of as a D = 1024 dimensional point cloud. See Figure 11c. 4. http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php. Pen Digits This data set5 consists of 3779 spatially resampled digital signals of hand-written digits in 16 dimensions (Alimoglu and Alpaydin, 1996). We consider a subset consisting of five digits: {0, 2, 3, 4, 6}. 5. https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+ Digits. Landsat The landsat satellite data we consider consists of pixels in 3 3 neighborhoods in a multispectral camera with four spectral bands6. This leads to a total ambient dimension of D = 36. The data considered consists of K = 4 classes, consisting of pixels of different physical materials: red soil, cotton, damp soil, and soil with vegetable stubble. 6. https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)
Dataset Splits No No explicit mention of training/test/validation dataset splits or cross-validation setup is provided. The paper mentions overall sample sizes and numbers of noise points, but not how these were partitioned for experimental evaluation.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models, memory, or cloud infrastructure specifications.
Software Dependencies No The paper mentions that 'Matlab code implementing both the fast LLPD nearest neighbor searches and LLPD spectral clustering is publicly available', but it does not specify the version of Matlab used or any other software dependencies with their respective version numbers.
Experiment Setup Yes Parameters were set consistently across all examples, unless otherwise noted. The initial E-nearest neighbor graph was constructed using k Euc = 20. The scales {ts}m s=1 for approximation were chosen to increase exponentially while requiring m = 20. Nearest neighbor denoising was performed using knse = 20. The denoising threshold θ was chosen by estimating the elbow in a graph of sorted nearest neighbor distances. For each data set, LSYM was computed for 20 σ values equally spaced in an interval.