Linear Relaxations for Finding Diverse Elements in Metric Spaces
Authors: Aditya Bhaskara, Mehrdad Ghadiri, Vahab Mirrokni, Ola Svensson
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we study the empirical performance of our algorithm on several standard datasets. Our goal here is two-fold: first, we make an experimental case for the sum-min objective, by comparing the quality of the solutions output by our algorithm (which aims to maximize sum-min) with other popular algorithms (that maximize sum-sum). This is measured by how well the solution covers various clusters in the data, as well as by measuring quality in a feature selection task. Second, we study the approximation quality of the algorithm on real datasets, and observe that it performs much better than the theoretical guarantee (factor 1/8). |
| Researcher Affiliation | Collaboration | Aditya Bhaskara University of Utah bhaskara@cs.utah.edu Mehrdad Ghadiri Sharif University of Technology ghadiri@ce.sharif.edu Vahab Mirrokni Google Research mirrokni@google.com Ola Svensson EPFL ola.svensson@epfl.ch |
| Pseudocode | Yes | procedure round(x) // LP solution (x) 1: Initialize S = . 2: Add (i, r) to S with probability (1 )(1 e xir) (independent of the other point-radius pairs). 3: If (i, r) = (j, r ) S such that r r and i B (j, r ), remove (i, r) from S. 4: If |S| > k, abort (i.e., return which has value 0); else return S, the set of first coordinates of S. procedure generalized-round(y, x) // LP solution (y, x). 1: Initialize S = . 2: Apply randomized swap rounding to the vector y/2 to obtain Y {0, 1}V P(M). 3: For each i with Yi = 1, add i to S and sample a radius ri according to the probability distribution that selects r Ri with probability xir/yi. 4: If i B (j, rj) with i = j S and rj ri, remove i from S. 5: Return S. |
| Open Source Code | No | The paper does not provide any explicit statement or link to open-source code for the described methodology. |
| Open Datasets | Yes | We used two datasets with ground-truth clusterings. The first is COIL100, which contains images of 100 different objects [17]. [...] The second dataset is CDK2 a drug discovery dataset publicly available in Binding DB.org [15, 1]. [...] We used two handwritten text datasets. Multiple Features is a dataset of handwritten digits (649 features, 2000 instances [14]). USPS is a dataset of handwritten text (256 features, 9298 instances [12, 5]). |
| Dataset Splits | Yes | Then, we restrict data to just the chosen features, and see how well 3-NN clustering in the obtained space (which is much faster to perform than in the original space, due to the reduced number of features) compares with ground-truth clustering. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., libraries, frameworks, solvers). |
| Experiment Setup | Yes | The goal of our experiments is to evaluate the sum-min objective as well as the approximation quality of our algorithm on real datasets. For the first of the two, we consider the k-element subsets obtained by maximizing the sum-min objective (using a slight variant of our algorithm), and compare their quality (in terms of being representative of the data) with subsets obtained by maximizing the sum-sum objective, which is the most commonly used diversity objective. [...] We used Euclidean distance as the metric. [...] Tanimoto distance, which is widely used in the drug discovery literature (and is similar to Jaccard distance), was used as the metric. [...] 3-NN clustering with 10-fold cross validation. |