Topkapi: Parallel and Fast Sketches for Finding Top-K Frequent Elements
Authors: Ankush Mandal, He Jiang, Anshumali Shrivastava, Vivek Sarkar
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that the new proposal is on average 2.5x faster in practice than FA for distributed and multi-threaded execution. |
| Researcher Affiliation | Academia | Ankush Mandal School of Computer Science Georgia Institute of Technology Atlanta, GA ankush@gatech.edu; He Jiang Department of Computer Science Rice University Houston, TX cary.jiang@rice.edu; Anshumali Shrivastava Department of Computer Science Rice University Houston, TX anshumali@rice.edu; Vivek Sarkar School of Computer Science Georgia Institute of Technology Atlanta, GA vsarkar@gatech.edu |
| Pseudocode | Yes | Algorithm 1: Topkapi; Algorithm 2: Topkapi_Parallel(S[][], K, N, T) |
| Open Source Code | Yes | https://github.com/ankushmandal/topkapi.git |
| Open Datasets | Yes | As for data we have used text data from the Project Gutenberg [1] corpus and PUMA Datasets [9]. |
| Dataset Splits | No | The paper uses "16GB data" and "128GB data" from public sources and discusses "strong scaling" and "weak scaling," but it does not specify train/validation/test splits for these datasets. |
| Hardware Specification | Yes | Used a cluster of Intel R Westmere processors with each node having 12 cores. Used a single node with 32 cores from four IBM Power R 7 chip. Used a cluster of IBM Power R 7 processors where each node has 32 cores from four processor chips. |
| Software Dependencies | No | The paper states implementations are in C++ and uses MPI for communication, but it does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For all the experiments, K is set to 100 unless otherwise stated. We used 512 and 2048 buckets or counters respectively for K=50 and K=200. For example, with l = 4 and b = 1024, the size of the count array is 16KB and the size of the id array is 64KB. Number of threads per node is 8. Used a cluster of Intel R Westmere processors with each node having 12 cores. |