Query Complexity of Clustering with Side Information

Authors: Arya Mazumdar, Barna Saha

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we report experimental results on a popular bibliographic dataset cora [35] consisting of 1879 nodes, 191 clusters and 1699612 edges out of which 62891 are intra-cluster edges. We remove any singleton node from the dataset the final number of vertices that we classify is 1812 with 124 clusters. We use the similarity function computation used by [18] to compute f+ and f . The two distributions are shown in Figure 1 on the left. The Hellinger square divergence between the two distributions is 0.6. In order to observe the dependency of the algorithm performance on the learnt distributions, we perturb the exact distributions to obtain two approximate distributions as shown in Figure 1 (middle) with Hellinger square divergence being 0.4587. We consider three strategies. Suppose the cluster in which a node v must be included has already been initialized and exists in the current solution. Moreover, suppose the algorithm decides to use queries to find membership of v. Then in the best strategy, only one query is needed to identify the cluster in which v belongs. In the worst strategy, the algorithm finds the correct cluster after querying all the existing clusters whose current membership is not enough to take a decision using side information. In the greedy strategy, the algorithm queries the clusters in non-decreasing order of Hellinger square divergence between f+ (or approximate version of it) and the estimated distribution from side information between v and each existing clusters. Note that, in practice, we will follow the greedy strategy. Figure 2 shows the performance of each strategy. We plot the number of queries vs F1 Score which computes the harmonic mean of precision and recall. We observe that the performance of greedy strategy is very close to that of best. With just 1136 queries, greedy achieves 80% precision and close to 90% recall. The best strategy would need 962 queries to achieve that performance. The performance of our algorithm on the exact and approximate distributions are also very close which indicates it is enough to learn a distribution that is close to exact. For example, using the approximate distributions, to achieve similar precision and recall, the greedy strategy just uses 1148 queries, that is 12 queries more than when we use when the distributions are known.
Researcher Affiliation Academia Arya Mazumdar and Barna Saha College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 {arya,barna}@cs.umass.edu
Pseudocode No The paper describes an algorithm in Section 3 with
Open Source Code No The paper does not include any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes In this section, we report experimental results on a popular bibliographic dataset cora [35] ... [35] A. Mc Callum, 2004. http://www.cs.umass.edu/~mcallum/data/cora-refs.tar.gz.
Dataset Splits No The paper does not explicitly specify exact training, validation, or test dataset splits or cross-validation setup for its experiments.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers used for the experiments.
Experiment Setup No The paper describes different querying strategies (best, greedy, worst) and how distributions were perturbed for experiments, but it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, number of epochs) or training configurations.