Coresets for Archetypal Analysis

Authors: Sebastian Mair, Ulf Brefeld

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that the coresets lead to improved performance on several data sets. Finally, (v) we provide empirical results on various data sets to support the theoretical derivation. 4 Experiments We now evaluate the coreset construction for archetypal analysis (abs-cs) and compare it to the performance of archetypal analysis on the full data set, an uniform sample (uniform), the lightweightcoresets for k-means (lw-cs, Bachem et al. (2018)), a state-of-the-art coreset construction for kmeans (lucic-cs, Lucic et al. (2016)) as well as an approximate solution that learns archetypes on a precomputed subset (frame, Mair et al. (2017)).
Researcher Affiliation Academia Sebastian Mair Leuphana University, Germany mair@leuphana.de Ulf Brefeld Leuphana University, Germany brefeld@leuphana.de
Pseudocode Yes Algorithm 1 Lightweight coreset construction for k-means (Bachem et al., 2018) Algorithm 2 Coreset construction for Archetypal Analysis
Open Source Code Yes 2https://github.com/smair/archetypalanalysis-coreset
Open Datasets Yes We evaluate the algorithms on several data sets: Ijcnn1 refers to data from the IJCNN 2001 neural network competition and has n = 49, 990 instances in d = 22 dimensions.3 We adopt the preprocessing from Chang and Lin (2001).3https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ Pose is a subset of the Human3.6M data set (Catalin Ionescu, 2011; Ionescu et al., 2014) and deals with 3D human pose estimation and is part of the ECCV 2018 Pose Track Challenge.4 4http://vision.imar.ro/human3.6m/challenge_open.php Song is a subset of the Million Song Dataset (Bertin Mahieux et al., 2011) which has n = 515, 345 data points in d = 90 dimensions where the task is to predict the year of a song. Covertype (Blackard and Dean, 1999) contains n = 581, 012 examples in d = 54 dimensions.
Dataset Splits No The paper does not explicitly provide specific train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology). It mentions comparing performance on coresets to the 'full data set', but does not detail how data was partitioned for training, validation, or testing in a reproducible manner.
Hardware Specification Yes The code is written in Python and all experiments run on an Intel Xeon machine with 28 2.60GHz and 256GB memory.
Software Dependencies No The paper mentions that 'The code is written in Python' but does not specify the version of Python or any other software libraries or dependencies with their version numbers.
Experiment Setup Yes We initialize the archetypes z1, . . . , zk using the furthest sum procedure (Mørup and Hansen, 2012). The termination criterion is reached when the relative error between iterations is less than 10-3.