Revisiting Consistent Hashing with Bounded Loads
Authors: John Chen, Benjamin Coleman, Anshumali Shrivastava3976-3983
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the AOL search dataset and Indiana University Clicks dataset with real user activity, our proposed solution reduces cache misses by several magnitudes.Experimental Evaluations For evaluation, we provide both simulation results and results on real server logs. We emphasize that the cascaded overflow effect is the major differentiator between RJ-CH and CH-BL and is the largest source of of difference in performance.Simulation Results We generate n objects and k bins where each bin has capacity C = n k (1 + ϵ) . We hash each of the bins into a large array, resolving bin collisions by rehashing. Bins are populated according to the two methods of RJ-CH and CH-BL. We sweep ϵ finely between 0.1 and 3, performing 1000 trials from scratch for each ϵ. We present results on percentage of bins full and wall clock time with 10000 objects and 1000 bins. Other results on variance of bin loads, bin searches, and objects till first full bin are given in the Appendix.Another setting with less load is also given in the Appendix, and results are similar. Exact implementation details are given in the Appendix.AOL Search Logs Experiments In this section we present results with real AOL search logs. This is a dataset of user activity with 3,826,181 urls clicked, of which 607,782 are unique. We selected a wide range of configurations, see Table 2, used in practice, such as reflecting the 80% of internet usage being video (Cisco 2020). Servers can come and go as in practice, they will fail when overloaded, and will come back online after a certain period of time. |
| Researcher Affiliation | Academia | John Chen, Benjamin Coleman, Anshumali Shrivastava Department of Computer Science, Rice University, Houston, USA {johnchen, Ben.Coleman, anshumali} @rice.edu |
| Pseudocode | No | No pseudocode or algorithm blocks were explicitly found or labeled as such. |
| Open Source Code | No | The paper does not contain an explicit statement that the source code for the methodology is available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | On the AOL search dataset and Indiana University Clicks dataset with real user activity, our proposed solution reduces cache misses by several magnitudes. In our experiments on real user logs from the AOL search dataset and Indiana University Clicks dataset (Meiss et al. 2008), the algorithm reduces cache misses by several orders of magnitude. AOL Search Logs Experiments In this section we present results with real AOL search logs. This is a dataset of user activity with 3,826,181 urls clicked, of which 607,782 are unique. Indiana University Clicks Search Logs Experiments In this section we present results using Indiana University Clicks search logs. This is a dataset of user activity, where we use the first 1,000,000 urls clicked of which 26,062 are unique. |
| Dataset Splits | No | The paper mentions using 'AOL search dataset' and 'Indiana University Clicks dataset' for experiments but does not explicitly detail training, validation, or test splits. It refers to 'Simulation Results' where they 'generate n objects and k bins' and 'sweep ϵ finely between 0.1 and 3, performing 1000 trials from scratch for each ϵ', but this is not a description of dataset splits for training/validation/test. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. It mentions 'distributed caching' and 'servers' but no specific CPU/GPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers for its implementation or experiments. It mentions applications like ASP.net, Microsoft, Mozilla, Discord, Amazon Dynamo, Apache Cassandra, Google Cloud, Vimeo, but these are contexts for CH, not the software dependencies for the paper's own methodology. |
| Experiment Setup | No | The paper describes simulation setups with parameters like 'n objects and k bins where each bin has capacity C = n k (1 + ϵ)' and sweeping 'ϵ finely between 0.1 and 3, performing 1000 trials'. For real datasets, it provides a table of 'Distributed caching configuration for AOL search dataset' (Table 2) including 'Number of servers', 'Cache size', 'Minutes for stale urls to be evicted', 'Minutes requests are served', 'Minutes for failed server to recover', and 'Number of concurrent requests till server failure'. While these are experimental settings, they are high-level system configurations rather than specific hyperparameters or model training settings like learning rate, batch size, or optimizer details typically found in machine learning papers. |