Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering
Authors: Pan Li, Olgica Milenkovic
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In what follows, we compare the algorithms for submodular hypergraph clustering described in the previous section to two methods: The IPM for homogeneous hypergraph clustering (Hein et al., 2013) and the clique expansion method (CEM) for submodular hypergraph clustering (Li & Milenkovic, 2017). We focus on 2-way graph partitioning problems related to the University of California Irvine (UCI) datasets selected for analysis in (Hein et al., 2013), described in Table 1 of the Supplementary Material. The datasets include 20Newsgroups, Mushrooms, Covertype. In all datasets, ζ(E) was roughly 103, and each of these datasets describes multiple clusters. Since we are interested in 2-way partitioning, we focused on two pairs of clusters in Covertype, denoted by (4, 5) and (6, 7), and paired the four 20Newsgroups clusters, one of which includes Comp. and Sci, and another one which includes Rec. and Talk. The Mushrooms and 20Newsgroups datasets contain only categorical features, while Covertype also includes numerical features. We adopt the same approach as the one described in (Hein et al., 2013) to construct hyperedges: Each feature corresponds to one hyperedge; hence, each categorical feature is captured by one hyperedge, while numerical features are first quantized into 10 bins of equal size, and then mapped to hyperedges. To describe the submodular weights, we fix ϑe = 1 for all hyperedges and parametrize we using a variable α (0, 0.5] we(S; α) = 1 2 min 1, |S| α|e| , |e/S| The intuitive explanation behind our choice of weights is that it allows one to accommodate categorization errors and outliers: In contrast to the homogeneous case in which any partition of a hyperedge has weight one, the chosen submodular weights allow a smaller weight to be used when the hyperedge is partitioned into small parts, i.e., when min{|S|, |e/S|} < α|e| . In practice, α is chosen to be relatively small in all experiments, we set α 0.04, with α close to zero producing homogeneous hyperedge weights. The results are shown in Figure 1. As may be observed, both in terms of the Clustering error (i.e., the total number of erroneously classified vertices) and the values of the Cheeger constant, IPM-based methods outperform CEM. This is due to the fact that for large hyperedge sizes, CEM incurs a high distortion when approximating the submodular weights (O(ζ(E)) (Li & Milenkovic, 2017)). Moreover, as we(S) depends merely on |S|, the submodular hypergraph CEM reduces to the homogeneous hypergraph CEM (Zhou et al., 2007), which is an issue that the IPM-based method does not face. Comparing the performance of IPM on submodular hypergraphs (IPM-S) with that on homogeneous hypergraphs (IPM-H), we see that IPM-S achieves better clustering performance on both 20Newsgroups and Covertypes, and offers the same performance as IPM-H on the Mushrooms dataset. This indicates that it is practically useful to use submodular hyperedge weights for clustering purposes. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, USA. |
| Pseudocode | Yes | Algorithm 1 lists the steps of an SDP-based algorithm for minimizing R2(x), and it comes with approximation guarantees stated in Lemma 4.4. [...] Algorithm 2: IPM-based minimization of R1(x) |
| Open Source Code | No | The paper does not provide an explicit statement about the release of its source code or a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | We focus on 2-way graph partitioning problems related to the University of California Irvine (UCI) datasets selected for analysis in (Hein et al., 2013), described in Table 1 of the Supplementary Material. The datasets include 20Newsgroups, Mushrooms, Covertype. [...] Asuncion, Arthur and Newman, David. UCI machine learning repository, 2007. |
| Dataset Splits | No | The paper describes the datasets used and how subsets of data were selected for 2-way partitioning (e.g., 'focused on two pairs of clusters in Covertype'), but it does not specify explicit training, validation, or test dataset splits with percentages, sample counts, or cross-validation details for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory, or specific computing environments) used to run the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies (e.g., library names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | To describe the submodular weights, we fix ϑe = 1 for all hyperedges and parametrize we using a variable α (0, 0.5] we(S; α) = 1 2 min 1, |S| α|e| , |e/S| [...] In practice, α is chosen to be relatively small in all experiments, we set α 0.04, with α close to zero producing homogeneous hyperedge weights. |