Federated Heavy Hitter Recovery under Linear Sketching
Authors: Adria Gascon, Peter Kairouz, Ziteng Sun, Ananda Theertha Suresh
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also empirically evaluate our proposed algorithms and compare it with count-sketch baselines. Significant advantage of our algorithm is observed, especially when R is large. In the setting of Figure 1, to achieve an F1 score of 0.8, we see a 10x improvement in communication compared to the baseline using Count-sketch. We also empirically demonstrate our findings. In this section, we empirically evaluate our proposed algorithms (Algorithms 2 and 4) for the task of heavy hitter recovery and compare it with baseline methods including (1) Count-sketch based method; (2) IBLT-based method without subsampling (Algorithm 2 with τ = 1). We measure communication cost in units of words (denoted as C) and each word unit is an int16 object (can be communicated with 2 bytes) in python and C++ for implementation purposes. We will mainly focus on string data with characters from ROOT consisting of lower-case letters, digits and special symbols in { @ # ; : . / }. Our code is available at https:// github.com/google-research/federated. |
| Researcher Affiliation | Industry | 1Google Research. Authorship is in alphabetical order. Correspondence to: Ziteng Sun <zitengsun@google.com>. |
| Pseudocode | Yes | Algorithm 1 Threshold sampling. Algorithm 2 Subsampled IBLT with Lin Sketch. Algorithm 3 R-round Approx Hist with Lin Sketch Algorithm 4 Adaptive subsampled IBLT |
| Open Source Code | Yes | Our code is available at https:// github.com/google-research/federated. |
| Open Datasets | Yes | We take the ground-truth distribution of strings in the Stackoverflow dataset in Tensorflow Federated and truncate them to the first 3 characters in set ROOT. |
| Dataset Splits | No | The paper does not provide explicit details about training, validation, and test splits (e.g., percentages, sample counts, or specific splitting methodology). While experiments are conducted, the data partitioning details are not specified. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or other computational resources used for running experiments. |
| Software Dependencies | No | The paper mentions that the implementation uses 'python and C++' and specifies the unit 'int16 object', but it does not provide specific version numbers for these languages or any libraries, frameworks, or other software dependencies. |
| Experiment Setup | Yes | We take R {10, 30, 50, 100}, τ {20, 50, 100, 200}, M {1000, 2000, 5000, 10000} and C {100, 200, 500, 1000, 2000, 5000, 8000, 10000, 20000, 30000, 40000}. For Count-median method, we take the max F1 score over all H {5, 7, 9, 11} for each communication cost. In our experiment, each IBLT data structure is of size 8L0, where L0 is the targeted capacity for IBLT. We set b = 1 in our experiments, the estimated and the heavy hitters are defined as those with estimated frequency at least τ. In the adaptive algorithm (Algorithm 4), for the update rule, we use tr+1 = 0.5tr + 0.5tr ˆsr. |