Interactive Bayesian Hierarchical Clustering
Authors: Sharad Vikram, Sanjoy Dasgupta
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we present a series of experiments that illustrate how a little interaction leads to significantly better hierarchical clusterings. We evaluated the convergence properties of five different querying schemes. In a simple query , a user is presented with three random data and picks an odd one out. In a smart query , a user is unrealistically shown the entire candidate tree and reports a violated triplet. In a random query , the user is shown the induced candidate tree over a random subset of the data. In an active query , the user is shown a high variance subtree using tree-distance variance. Finally, in an interleaved query , the user is alternatively shown a random subtree and a high variance subtree. In each experiment, T was known, so user queries were simulated by picking a triplet violated by the root split of the queried tree, and if no such triplet existed, recursing on a child. Each scheme was evaluated on four different datasets. The first dataset, MNIST (Lecun et al., 1998), is an 10-way image classification dataset where the data are 28 x 28 images of digits. The target tree T is simply the K-way classification tree over the data. The second dataset is Fisher Iris, a 3-way flower classification problem, where each of 150 flowers has five features. The third dataset, Zoo (Lichman, 2013), is a set of 93 animals and 15 binary morphological features for each of animals, the target tree being the induced binary tree from the Open Tree of Life (Hinchliff et al., 2015). The fourth dataset is 20 Newsgroups (Joachims, 1997), a corpus of text articles on 20 different subjects. We use the first 10 principal components as features in this classification problem. All datasets were modeled with DDT s with acquisition function a(t) = 1/(1 t) and Brownian motion parameter σ2 estimated from data. To better visualize the different convergence rates of the querying schemes, MNIST and 20 Newsgroups were subsampled to 150 random points. For each dataset and querying scheme, we instantiated a SPR sampler with no constraints. Every one hundred iterations of the sampler, we performed a query. In subtree queries, we used subsets of size |S| = 10 and in active querying, the highest-variance subset was chosen from L = 20 different random subsets. As baselines, we measured the triplet distance of the vanilla DDT and the average linkage tree. Finally, results were averaged over four runs of each sampler. The triplet distances for Fisher Iris and MNIST can be seen in Figure 6. Results for the other datasets can be found in the supplement. Although unrealistic due to the size of the tree shown to the user, the smart query performed the best, achieving minimum error with the least amount of queries. Interleaved followed next, followed by active, random, and simple. In general, the vanilla DDT performed the worst, and the average linkage score varied on each dataset, but in all cases, the subtree querying schemes performed better than both the vanilla DDT and average linkage. In three datasets (MNIST, Fisher Iris and Zoo), interactive methods achieve higher data likelihood than the vanilla DDT. Initially, the sampler is often restructuring the tree with new triplets and data likelihood is unlikely to rise. However, over time as less triplets are reported, the data likelihood increases rapidly. We thus conjecture that triplet constraints may help the MH algorithm find better optima. |
| Researcher Affiliation | Academia | Sharad Vikram SVIKRAM@CS.UCSD.EDU Sanjoy Dasgupta DASGUPTA@CS.UCSD.EDU Computer Science and Engineering, UCSD, 9500 Gilman Drive, La Jolla, CA 92093 |
| Pseudocode | No | The paper describes algorithmic steps, such as those for the BUILD algorithm and SPR move, but does not present them in a structured pseudocode block or a clearly labeled 'Algorithm' section. |
| Open Source Code | No | The paper does not contain any statement about making its source code available, nor does it provide any links to a code repository. |
| Open Datasets | Yes | Each scheme was evaluated on four different datasets. The first dataset, MNIST (Lecun et al., 1998)... The second dataset is Fisher Iris... The third dataset, Zoo (Lichman, 2013)... The fourth dataset is 20 Newsgroups (Joachims, 1997). |
| Dataset Splits | No | The paper mentions subsampling certain datasets ('MNIST and 20 Newsgroups were subsampled to 150 random points') but does not provide explicit details on train, validation, or test splits, such as percentages or sample counts. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers, such as programming languages, libraries, or specific frameworks. |
| Experiment Setup | Yes | All datasets were modeled with DDT s with acquisition function a(t) = 1/(1 t) and Brownian motion parameter σ2 estimated from data. Every one hundred iterations of the sampler, we performed a query. In subtree queries, we used subsets of size |S| = 10 and in active querying, the highest-variance subset was chosen from L = 20 different random subsets. |