KwikBucks: Correlation Clustering with Cheap-Weak and Expensive-Strong Signals

Authors: Sandeep Silwal, Sara Ahmadian, Andrew Nystrom, Andrew McCallum, Deepak Ramachandran, Seyed Mehran Kazemi

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate gains in query minimization and clustering metrics on a variety of datasets with diverse strong and cheap oracles. We conduct extensive experiments with multiple well-studied datasets to evaluate the performance of Kwik Bucks over natural extensions of previous algorithms for closely-related problems.
Researcher Affiliation Collaboration Sandeep Silwal 1, Sara Ahmadian2, Andrew Nystrom2, Andrew Mc Callum2, Deepak Ramachandran 2, Mehran Kazemi 2 1 MIT, 2 Google Research
Pseudocode Yes Algorithm 1 Kwik Bucks (Our Algorithm) Algorithm 2 Assign To Clusters(P, U, Q) Algorithm 3 Get Pivots(t, Q) Algorithm 4 First Neighbor(v, N, Q) Algorithm 5 Weak Filter(v, S)
Open Source Code No The paper does not provide an explicit statement about releasing source code or a direct link to a code repository for the methodology described.
Open Datasets Yes Four public datasets are comprised of text inputs: Stackoverflow (SOF) (Xu et al., 2017), Search Snippets (Phan et al., 2008), Tweet (Yin & Wang, 2016) and Ag News (Rakib et al., 2020). The other four public datasets are comprised of attributed graphs: Cora (Sen et al., 2008), Amazon Photos (Shchur et al., 2018), Citeseer (Sen et al., 2008), and Microsoft Medicine (Shchur & G unnemann, 2019).
Dataset Splits No The paper mentions training on datasets and finetuning models, but it does not provide specific train/validation/test split percentages, sample counts, or explicit splitting methodologies needed to reproduce the data partitioning. For example, it states: "In both cases, we trained/finetuned on the training set of the datasets." and "In both cases, we sampled 10K positive pairs and 50K negative pairs and finetuned the model for 10 epochs...".
Hardware Specification Yes In both cases, we sampled 10K positive pairs and 50K negative pairs and finetuned the model for 10 epochs on a 4x4 Dragon Fish TPU architecture. When using 32 TPU v3 chips for the strong signal and a CPU for the geometric operations, each call to the strong signal was approximately 10^4 times slower (i.e. ζS 10^4ζG).
Software Dependencies No The paper mentions software packages like "Gensim package Rehurek & Sojka (2011) for word2vec and sklearn Pedregosa et al. (2011) for tf-idf" but does not specify their version numbers.
Experiment Setup Yes For our experiments we fix k = 100 and perform ablation studies on this parameter. Lastly, we always reserve 10% of the query budget for performing the merge post processing step. We also always set k = 10 when we use the Utilizing Weak Signal Neighborhood optimization of Section 3. We also always fix λ = 1/10 which appropriately normalizes the second term to be between 0 and 1. For spectral clustering, we always use k = 25 for the number of clusters.