Learning-Based Frequency Estimation Algorithms

Authors: Chen-Yu Hsu, Piotr Indyk, Dina Katabi, Ali Vakilian

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also evaluate our algorithms on two real-world datasets and demonstrate empirically their performance gains.
Researcher Affiliation Academia Chen-Yu Hsu, Piotr Indyk, Dina Katabi & Ali Vakilian Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139, USA {cyhsu,indyk,dk,vakilian}@mit.edu
Pseudocode Yes Algorithm 1 provides pseudo code for our design. The design assumes an oracle HH(i) that attempts to determine whether an item i is a heavy hitter or not.
Open Source Code No The paper does not provide a direct link or explicit statement about the availability of its own source code.
Open Datasets Yes CAIDA. Caida internet traces 2016 chicago. http://www.caida.org/data/monitors/passive-equinix-chicago.xml. Dataset: The traffic data is collected at a backbone link of a Tier1 ISP between Chicago and Seattle in 2016 (CAIDA). Dataset: We use the AOL query log dataset, which consists of 21 million search queries collected from 650 thousand users over 90 days. The users are anonymized in the dataset.
Dataset Splits Yes For a recording session, we use the first 7 minutes for training, the following minute for validation, and estimate the packet counts in subsequent minutes. We use the first 5 days for training, the following day for validation, and estimate the number of times different search queries appear in subsequent days.
Hardware Specification No The inference time takes 2.8 microseconds per item on a single GPU without any optimizations4. (Footnote 4) Note that new specialized hardware such as Google TPU, hardware accelerators and network compression (Han et al., 2017; Sze et al., 2017; Chen et al., 2017; Han et al., 2016; 2015) can drastically improve the inference time. Further, Nvidia has predicted that GPU will get 1000x faster by 2025. The specific model of the GPU is not mentioned.
Software Dependencies No The paper mentions using "neural networks", "RNNs", "fully-connected networks", and "LSTM cells", but it does not specify any software names with version numbers (e.g., TensorFlow 2.x, PyTorch 1.x).
Experiment Setup Yes Model: We trained a neural network to predict the log of the packet counts for each flow. ...We use two RNNs to encode the source and destination IP addresses separately. ...Additionally, we use two-layer fully-connected networks to encode the source and destination ports. ...3We use RNNs with 64 hidden units. The two-layer fully-connected networks for the ports have 16 and 8 hidden units. The final layer before the prediction has 32 hidden units. To construct the heavy hitter oracle, we trained a neural network to predict the number of times a search phrase appears. To process the search phrase, we train an RNN with LSTM cells that takes characters of a search phrase as input. ...6We use an embedding size of 64 dimensions, an RNN with 256 hidden units, and a fully-connected layer with 32 hidden units.