Differentiable Clustering with Perturbed Spanning Forests

Authors: Lawrence Stewart, Francis Bach, Felipe Llinares-Lopez, Quentin Berthet

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our framework for learning through clustering in both a supervised and a semi-supervised setting, as illustrated in Figure 1. Formally, for large training datasets of size n, we either have access to a full cluster connectivity matrix MΩor a partial one (typically built by using partial label information, see below). We use this clustering information MΩ, from which mini-batches can be extracted, as supervision. We minimize our partial Fenchel-Young loss with respect to the weights of an embedding model, and evaluate these embeddings in two main manners on a test dataset: first, by evaluating the clustering accuracy (i.e. proportion of correct coefficients in the predicted cluster connectivity matrix), and second by training a shallow model on a classification task (using clusters as classes) on a holdout set, evaluating it on a test set.
Researcher Affiliation Collaboration Lawrence Stewart ENS & INRIA Paris, France Francis Bach ENS & INRIA Paris, France Felipe Llinares-López Google Deep Mind Paris, France Quentin Berthet Google Deep Mind Paris, France
Pseudocode No The paper describes algorithms (like Kruskal's) in text but does not present them in structured pseudocode or algorithm blocks.
Open Source Code Yes Code base: https://github.com/Lawrence MMStewart/Diff Clust_Neur IPS2023
Open Datasets Yes We train a CNN (Le Net-5 Le Cun et al. (1998)) on mini-batches of size 64 using the partial Fenchel-Young loss to learn a clustering, with a batch-wise clustering precision of 0.99 for MNIST and 0.96 on Fashion MNIST. Using the same approach, we trained a Res Net (He et al., 2016) on CIFAR-10 (with some minor modifications to kernel size and stride), achieving a batch-wise clustering test precision of 0.93.
Dataset Splits Yes We also create a validation set of equal size (in exactly the same manner as the train set), to ensure the model has not over-fitted to the train set.
Hardware Specification Yes For this experiment and all experiments described below, we trained on a single Nvidia V100 GPU ; training the CNN with our proposed pipeline took < 15 minutes.
Software Dependencies Yes Our implementation of Kruskal s is tailored to our use: we first initialize both A k(Σ) and M k(Σ) as the identity matrix, and then sort the upper triangular entries of Σ. We build the maximumweight spanning forest in the usual greedy manner, using A k(Σ) to keep track of edges in the forest and M k(Σ) to check if an edge can be added without creating a cycle, updating both matrices at each step of the algorithm. Once the forest has k connected components, the algorithm terminates. This is done by keeping track of the number of edges that have been added at any time. We remark that our implementation takes the form as a single loop, with each step of the loop consisting only of matrix multiplications. For this reason it is fully compatible with auto-differentiation engines, such as JAX (Bradbury et al., 2018), Pytorch (Paszke et al., 2019) and Tensor Flow (Abadi et al., 2016), and suitable for GPU/TPU acceleration.
Experiment Setup Yes The model was trained for 30k gradient steps on mini-batches of size 64. We used the Adam optimizer (Kingma and Ba, 2015) with learning rate η = 3 10 4, momentum parameters (β1, β2) = (0.9, 0.999), and an ℓ2 weight decay of 10 4. We validated / tested the model using the zero-one error between the true equivalence matrix and the equivalence matrix corresponding to the output of the model.