Triangle and Four Cycle Counting with Predictions in Graph Streams

Authors: Justin Y Chen, Talya Eden, Piotr Indyk, Honghao Lin, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner, David Woodruff, Michael Zhang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, our experiments demonstrate advantages of the proposed method compared to state-of-the-art streaming algorithms. 4 EXPERIMENTS
Researcher Affiliation Collaboration Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, and Sandeep Silwal Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA {justc, indyk, shyamsn, ronitt, silwal}@mit.edu Honghao Lin, David P. Woodruff, Michael Zhang Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213, USA {honghaol, dwoodruf, jinyaoz}@andrew.cmu.edu Tal Wagner Microsoft Research Redmond, WA 98052, USA tal.wagner@gmail.com Talya Eden Massachusetts Institute of Technology and Boston University Cambridge, MA 02139, USA teden@mit.edu
Pseudocode Yes Algorithm 1 Counting Triangles in the Adjacency List Model
Open Source Code No The paper mentions using code from other authors for baselines (e.g., "We use the authors code for our experiments (Shin et al., 2020; Shin, 2020).") and for specific components (e.g., "Here we use the method and code proposed in Zhang & Chen (2018)."). However, it does not explicitly state that the source code for the methodology described in this paper is released or provide a link to it.
Open Datasets Yes The Oregon and CAIDA datasets come from Leskovec & Krevl (2014); Leskovec et al. (2005), the Wikibooks dataset comes from Rossi & Ahmed (2015), the Reddit dataset comes from Leskovec & Krevl (2014); Kumar et al. (2018), the Twitch dataset comes from Rozemberczki et al. (2019), the Wikipedia dataset comes from Rossi & Ahmed (2015), and the Powerlaw graphs are sampled from the Chung-Lu-Vu random graph model with expected degree of the i-th vertex proportional to 1/i2 (Chung et al., 2003).
Dataset Splits Yes In the Wiki Books temporal graph, we use the exact Ne counts on the first half of the graph edges (when sorted by their timestamps) as the predicted values for the second half. To produce an edge f(e) embedding for an edge e = uv from the node embedding of its endpoints, f(u) and f(v)... Training is done on a prefix of the first half of the edges. For each of the two networks, we start with a graph that has twice as many edges as listed in Table 2... We then randomly remove 50% of the total edges to be the training data set, and use the remaining edges as the graph we test on.
Hardware Specification Yes All of our graph experiments were done on a CPU with i5 2.7 GHz dual core and 8 GB RAM or a CPU with i7 1.9 GHz 8 core and 16GB RAM.
Software Dependencies No The paper mentions using "the authors code" for baselines (Think D and WRS) and refers to the GNN method by Zhang & Chen (2018), implying software was used. However, it does not list specific software components with version numbers (e.g., "Python 3.8, PyTorch 1.9, and CUDA 11.1") that are needed to replicate their experiment.
Experiment Setup Yes To remedy this discrepancy, we modify our algorithm slightly by setting a fixed fraction of space to use for heavy edges (10% of space for all of our experiments) and setting p correspondingly to use up the rest of the space bound given as input. The constant 0.3 is fixed throughout all our experiments. We then set p = 0.7 Z/m. for the multi-layer version), we use 10% of the total space for storing the top k = Z/10 edges, and 70% of the space for sub-sampling the edges that the oracle predict value is very tiny... Then, we use 20% of the space for sub-sampling the remaining edges...