KwikBucks: Correlation Clustering with Cheap-Weak and Expensive-Strong Signals
Authors: Sandeep Silwal, Sara Ahmadian, Andrew Nystrom, Andrew McCallum, Deepak Ramachandran, Seyed Mehran Kazemi
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate gains in query minimization and clustering metrics on a variety of datasets with diverse strong and cheap oracles. We conduct extensive experiments with multiple well-studied datasets to evaluate the performance of Kwik Bucks over natural extensions of previous algorithms for closely-related problems. |
| Researcher Affiliation | Collaboration | Sandeep Silwal 1, Sara Ahmadian2, Andrew Nystrom2, Andrew Mc Callum2, Deepak Ramachandran 2, Mehran Kazemi 2 1 MIT, 2 Google Research |
| Pseudocode | Yes | Algorithm 1 Kwik Bucks (Our Algorithm) Algorithm 2 Assign To Clusters(P, U, Q) Algorithm 3 Get Pivots(t, Q) Algorithm 4 First Neighbor(v, N, Q) Algorithm 5 Weak Filter(v, S) |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | Four public datasets are comprised of text inputs: Stackoverflow (SOF) (Xu et al., 2017), Search Snippets (Phan et al., 2008), Tweet (Yin & Wang, 2016) and Ag News (Rakib et al., 2020). The other four public datasets are comprised of attributed graphs: Cora (Sen et al., 2008), Amazon Photos (Shchur et al., 2018), Citeseer (Sen et al., 2008), and Microsoft Medicine (Shchur & G unnemann, 2019). |
| Dataset Splits | No | The paper mentions training on datasets and finetuning models, but it does not provide specific train/validation/test split percentages, sample counts, or explicit splitting methodologies needed to reproduce the data partitioning. For example, it states: "In both cases, we trained/finetuned on the training set of the datasets." and "In both cases, we sampled 10K positive pairs and 50K negative pairs and finetuned the model for 10 epochs...". |
| Hardware Specification | Yes | In both cases, we sampled 10K positive pairs and 50K negative pairs and finetuned the model for 10 epochs on a 4x4 Dragon Fish TPU architecture. When using 32 TPU v3 chips for the strong signal and a CPU for the geometric operations, each call to the strong signal was approximately 10^4 times slower (i.e. ζS 10^4ζG). |
| Software Dependencies | No | The paper mentions software packages like "Gensim package Rehurek & Sojka (2011) for word2vec and sklearn Pedregosa et al. (2011) for tf-idf" but does not specify their version numbers. |
| Experiment Setup | Yes | For our experiments we fix k = 100 and perform ablation studies on this parameter. Lastly, we always reserve 10% of the query budget for performing the merge post processing step. We also always set k = 10 when we use the Utilizing Weak Signal Neighborhood optimization of Section 3. We also always fix λ = 1/10 which appropriately normalizes the second term to be between 0 and 1. For spectral clustering, we always use k = 25 for the number of clusters. |