Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Triangle and Four Cycle Counting with Predictions in Graph Streams

Authors: Justin Y Chen, Talya Eden, Piotr Indyk, Honghao Lin, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner, David Woodruff, Michael Zhang

ICLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, our experiments demonstrate advantages of the proposed method compared to state-of-the-art streaming algorithms. 4 EXPERIMENTS
Researcher Affiliation Collaboration Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, and Sandeep Silwal Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA EMAIL Honghao Lin, David P. Woodruff, Michael Zhang Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213, USA EMAIL Tal Wagner Microsoft Research Redmond, WA 98052, USA EMAIL Talya Eden Massachusetts Institute of Technology and Boston University Cambridge, MA 02139, USA EMAIL
Pseudocode Yes Algorithm 1 Counting Triangles in the Adjacency List Model
Open Source Code No The paper mentions using code from other authors for baselines (e.g., "We use the authors code for our experiments (Shin et al., 2020; Shin, 2020).") and for specific components (e.g., "Here we use the method and code proposed in Zhang & Chen (2018)."). However, it does not explicitly state that the source code for the methodology described in this paper is released or provide a link to it.
Open Datasets Yes The Oregon and CAIDA datasets come from Leskovec & Krevl (2014); Leskovec et al. (2005), the Wikibooks dataset comes from Rossi & Ahmed (2015), the Reddit dataset comes from Leskovec & Krevl (2014); Kumar et al. (2018), the Twitch dataset comes from Rozemberczki et al. (2019), the Wikipedia dataset comes from Rossi & Ahmed (2015), and the Powerlaw graphs are sampled from the Chung-Lu-Vu random graph model with expected degree of the i-th vertex proportional to 1/i2 (Chung et al., 2003).
Dataset Splits Yes In the Wiki Books temporal graph, we use the exact Ne counts on the first half of the graph edges (when sorted by their timestamps) as the predicted values for the second half. To produce an edge f(e) embedding for an edge e = uv from the node embedding of its endpoints, f(u) and f(v)... Training is done on a prefix of the first half of the edges. For each of the two networks, we start with a graph that has twice as many edges as listed in Table 2... We then randomly remove 50% of the total edges to be the training data set, and use the remaining edges as the graph we test on.
Hardware Specification Yes All of our graph experiments were done on a CPU with i5 2.7 GHz dual core and 8 GB RAM or a CPU with i7 1.9 GHz 8 core and 16GB RAM.
Software Dependencies No The paper mentions using "the authors code" for baselines (Think D and WRS) and refers to the GNN method by Zhang & Chen (2018), implying software was used. However, it does not list specific software components with version numbers (e.g., "Python 3.8, PyTorch 1.9, and CUDA 11.1") that are needed to replicate their experiment.
Experiment Setup Yes To remedy this discrepancy, we modify our algorithm slightly by setting a fixed fraction of space to use for heavy edges (10% of space for all of our experiments) and setting p correspondingly to use up the rest of the space bound given as input. The constant 0.3 is fixed throughout all our experiments. We then set p = 0.7 Z/m. for the multi-layer version), we use 10% of the total space for storing the top k = Z/10 edges, and 70% of the space for sub-sampling the edges that the oracle predict value is very tiny... Then, we use 20% of the space for sub-sampling the remaining edges...