Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Path-Based Spectral Clustering: Guarantees, Robustness to Outliers, and Fast Algorithms

Authors: Anna Little, Mauro Maggioni, James M. Murphy

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed method is demonstrated on a variety of synthetic and real data sets, with performance consistently with our theoretical results. Article outline. ... Numerical experiments on representative data sets appear in Section 7.
Researcher Affiliation	Academia	Anna Little EMAIL Department of Computational Mathematics, Science, and Engineering Michigan State University, East Lansing, MI 48824, USA Mauro Maggioni EMAIL Department of Applied Mathematics and Statistics, Department of Mathematics Johns Hopkins University, Baltimore, MD 21218, USA James M. Murphy EMAIL Department of Mathematics Tufts University, Medford, MA 02139, USA
Pseudocode	Yes	Algorithm 1 Spectral clustering with metric ρ Input: {xi}n i=1 (data) , σ > 0 (scaling parameter), fσ (kernel function) Output: Y (Labels) 1: Compute the weight matrix W Rn n with Wij = fσ(ρ(xi, xj)).
Open Source Code	Yes	Matlab code implementing both the fast LLPD nearest neighbor searches and LLPD spectral clustering is publicly available at https://bitbucket.org/annavlittle/llpd_code/ branch/v2.1.
Open Datasets	Yes	Skins This large data set consists of RGB values corresponding to pixels sampled from two classes: human skin and other2. The human skin samples are widely sampled with respect to age, gender, and skin color; see Bhatt et al. (2009) for details on the construction of the data set. This data set consists of 245057 data points in D = 3 dimensions, corresponding to the RGB values. Note LLPD was approximated from scales {ts}m s=1 deﬁned by 10 percentiles, as opposed to the default exponential scaling. See Figure 11a. 2. https://archive.ics.uci.edu/ml/datasets/skin+segmentation. Driv Face The Driv Face data set is publicly available3 from the UCI Machine Learning Repository (Lichman, 2013). This data set consists of 606 80 80 pixel images of the faces of four drivers, 2 male and 2 female. See Figure 11b 3. https://archive.ics.uci.edu/ml/datasets/Driv Face. COIL The COIL (Columbia University Image Library) data set4 consists of images of 20 diﬀerent objects captured at varying angles (Nene et al., 1996). There are 1440 diﬀerent data points, each of which is a 32 32 image, thought of as a D = 1024 dimensional point cloud. See Figure 11c. 4. http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php. Pen Digits This data set5 consists of 3779 spatially resampled digital signals of hand-written digits in 16 dimensions (Alimoglu and Alpaydin, 1996). We consider a subset consisting of ﬁve digits: {0, 2, 3, 4, 6}. 5. https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+ Digits. Landsat The landsat satellite data we consider consists of pixels in 3 3 neighborhoods in a multispectral camera with four spectral bands6. This leads to a total ambient dimension of D = 36. The data considered consists of K = 4 classes, consisting of pixels of diﬀerent physical materials: red soil, cotton, damp soil, and soil with vegetable stubble. 6. https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)
Dataset Splits	No	No explicit mention of training/test/validation dataset splits or cross-validation setup is provided. The paper mentions overall sample sizes and numbers of noise points, but not how these were partitioned for experimental evaluation.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models, memory, or cloud infrastructure specifications.
Software Dependencies	No	The paper mentions that 'Matlab code implementing both the fast LLPD nearest neighbor searches and LLPD spectral clustering is publicly available', but it does not specify the version of Matlab used or any other software dependencies with their respective version numbers.
Experiment Setup	Yes	Parameters were set consistently across all examples, unless otherwise noted. The initial E-nearest neighbor graph was constructed using k Euc = 20. The scales {ts}m s=1 for approximation were chosen to increase exponentially while requiring m = 20. Nearest neighbor denoising was performed using knse = 20. The denoising threshold θ was chosen by estimating the elbow in a graph of sorted nearest neighbor distances. For each data set, LSYM was computed for 20 σ values equally spaced in an interval.