reproducibilityindex.ai

Federated Heavy Hitter Recovery under Linear Sketching

Authors: Adria Gascon, Peter Kairouz, Ziteng Sun, Ananda Theertha Suresh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also empirically evaluate our proposed algorithms and compare it with count-sketch baselines. Significant advantage of our algorithm is observed, especially when R is large. In the setting of Figure 1, to achieve an F1 score of 0.8, we see a 10x improvement in communication compared to the baseline using Count-sketch. We also empirically demonstrate our findings. In this section, we empirically evaluate our proposed algorithms (Algorithms 2 and 4) for the task of heavy hitter recovery and compare it with baseline methods including (1) Count-sketch based method; (2) IBLT-based method without subsampling (Algorithm 2 with τ = 1). We measure communication cost in units of words (denoted as C) and each word unit is an int16 object (can be communicated with 2 bytes) in python and C++ for implementation purposes. We will mainly focus on string data with characters from ROOT consisting of lower-case letters, digits and special symbols in { @ # ; : . / }. Our code is available at https:// github.com/google-research/federated.
Researcher Affiliation	Industry	1Google Research. Authorship is in alphabetical order. Correspondence to: Ziteng Sun <zitengsun@google.com>.
Pseudocode	Yes	Algorithm 1 Threshold sampling. Algorithm 2 Subsampled IBLT with Lin Sketch. Algorithm 3 R-round Approx Hist with Lin Sketch Algorithm 4 Adaptive subsampled IBLT
Open Source Code	Yes	Our code is available at https:// github.com/google-research/federated.
Open Datasets	Yes	We take the ground-truth distribution of strings in the Stackoverﬂow dataset in Tensorﬂow Federated and truncate them to the ﬁrst 3 characters in set ROOT.
Dataset Splits	No	The paper does not provide explicit details about training, validation, and test splits (e.g., percentages, sample counts, or specific splitting methodology). While experiments are conducted, the data partitioning details are not specified.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or other computational resources used for running experiments.
Software Dependencies	No	The paper mentions that the implementation uses 'python and C++' and specifies the unit 'int16 object', but it does not provide specific version numbers for these languages or any libraries, frameworks, or other software dependencies.
Experiment Setup	Yes	We take R {10, 30, 50, 100}, τ {20, 50, 100, 200}, M {1000, 2000, 5000, 10000} and C {100, 200, 500, 1000, 2000, 5000, 8000, 10000, 20000, 30000, 40000}. For Count-median method, we take the max F1 score over all H {5, 7, 9, 11} for each communication cost. In our experiment, each IBLT data structure is of size 8L0, where L0 is the targeted capacity for IBLT. We set b = 1 in our experiments, the estimated and the heavy hitters are deﬁned as those with estimated frequency at least τ. In the adaptive algorithm (Algorithm 4), for the update rule, we use tr+1 = 0.5tr + 0.5tr ˆsr.