Linear Relaxations for Finding Diverse Elements in Metric Spaces

Authors: Aditya Bhaskara, Mehrdad Ghadiri, Vahab Mirrokni, Ola Svensson

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we study the empirical performance of our algorithm on several standard datasets. Our goal here is two-fold: first, we make an experimental case for the sum-min objective, by comparing the quality of the solutions output by our algorithm (which aims to maximize sum-min) with other popular algorithms (that maximize sum-sum). This is measured by how well the solution covers various clusters in the data, as well as by measuring quality in a feature selection task. Second, we study the approximation quality of the algorithm on real datasets, and observe that it performs much better than the theoretical guarantee (factor 1/8).
Researcher Affiliation Collaboration Aditya Bhaskara University of Utah bhaskara@cs.utah.edu Mehrdad Ghadiri Sharif University of Technology ghadiri@ce.sharif.edu Vahab Mirrokni Google Research mirrokni@google.com Ola Svensson EPFL ola.svensson@epfl.ch
Pseudocode Yes procedure round(x) // LP solution (x) 1: Initialize S = . 2: Add (i, r) to S with probability (1 )(1 e xir) (independent of the other point-radius pairs). 3: If (i, r) = (j, r ) S such that r r and i B (j, r ), remove (i, r) from S. 4: If |S| > k, abort (i.e., return which has value 0); else return S, the set of first coordinates of S. procedure generalized-round(y, x) // LP solution (y, x). 1: Initialize S = . 2: Apply randomized swap rounding to the vector y/2 to obtain Y {0, 1}V P(M). 3: For each i with Yi = 1, add i to S and sample a radius ri according to the probability distribution that selects r Ri with probability xir/yi. 4: If i B (j, rj) with i = j S and rj ri, remove i from S. 5: Return S.
Open Source Code No The paper does not provide any explicit statement or link to open-source code for the described methodology.
Open Datasets Yes We used two datasets with ground-truth clusterings. The first is COIL100, which contains images of 100 different objects [17]. [...] The second dataset is CDK2 a drug discovery dataset publicly available in Binding DB.org [15, 1]. [...] We used two handwritten text datasets. Multiple Features is a dataset of handwritten digits (649 features, 2000 instances [14]). USPS is a dataset of handwritten text (256 features, 9298 instances [12, 5]).
Dataset Splits Yes Then, we restrict data to just the chosen features, and see how well 3-NN clustering in the obtained space (which is much faster to perform than in the original space, due to the reduced number of features) compares with ground-truth clustering.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., libraries, frameworks, solvers).
Experiment Setup Yes The goal of our experiments is to evaluate the sum-min objective as well as the approximation quality of our algorithm on real datasets. For the first of the two, we consider the k-element subsets obtained by maximizing the sum-min objective (using a slight variant of our algorithm), and compare their quality (in terms of being representative of the data) with subsets obtained by maximizing the sum-sum objective, which is the most commonly used diversity objective. [...] We used Euclidean distance as the metric. [...] Tanimoto distance, which is widely used in the drug discovery literature (and is similar to Jaccard distance), was used as the metric. [...] 3-NN clustering with 10-fold cross validation.