Robust Online Correlation Clustering

Authors: Silvio Lattanzi, Benjamin Moseley, Sergei Vassilvitskii, Yuyan Wang, Rudy Zhou

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a novel online algorithm for Correlation Clustering, which is robust to noise and can be applied in real-time streaming settings. We provide theoretical guarantees for the performance of our algorithm and complement our theoretical analysis with an extensive experimental evaluation on synthetic and real-world datasets.
Researcher Affiliation Academia Sayan Das, Archan Ray, Somdeb Sarkhel, A. M. Sharan and Saptarshi Ghosh Indian Statistical Institute, Kolkata, India
Pseudocode Yes Algorithm 1: ROCC(G)
Open Source Code No No explicit statement or link providing access to the open-source code for the described methodology was found.
Open Datasets Yes For real-world datasets, we use three popular benchmarks that are widely used in the literature of correlation clustering: Blogs, Cora, and Slashdot. These datasets are publicly available from the Stanford Network Analysis Project (SNAP).
Dataset Splits Yes For BlogCatalog, we randomly sample 1000 nodes from the original network and form a graph with 20380 edges and 39 labels. For Cora, we consider the largest connected component of the graph, which consists of 2485 nodes, 5069 edges and 7 labels. For Slashdot, we consider the largest connected component consisting of 77360 nodes, 458999 edges and 20 categories. For synthetic datasets, we generate graphs of varying densities and sizes, ranging from 100 to 1000 nodes.
Hardware Specification No No specific hardware details (e.g., CPU/GPU models, memory, cloud instance types) were mentioned.
Software Dependencies No No specific software dependencies with version numbers were mentioned.
Experiment Setup Yes The parameters (c1, c2) for our algorithm are set to (0.5, 0.5) for the theoretical analysis and kept fixed throughout all experiments. For the synthetic datasets, we vary the number of nodes from 100 to 1000. We average the results over 50 runs.