reproducibilityindex.ai

Revisiting Consistent Hashing with Bounded Loads

Authors: John Chen, Benjamin Coleman, Anshumali Shrivastava3976-3983

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the AOL search dataset and Indiana University Clicks dataset with real user activity, our proposed solution reduces cache misses by several magnitudes.Experimental Evaluations For evaluation, we provide both simulation results and results on real server logs. We emphasize that the cascaded overﬂow effect is the major differentiator between RJ-CH and CH-BL and is the largest source of of difference in performance.Simulation Results We generate n objects and k bins where each bin has capacity C = n k (1 + ϵ) . We hash each of the bins into a large array, resolving bin collisions by rehashing. Bins are populated according to the two methods of RJ-CH and CH-BL. We sweep ϵ ﬁnely between 0.1 and 3, performing 1000 trials from scratch for each ϵ. We present results on percentage of bins full and wall clock time with 10000 objects and 1000 bins. Other results on variance of bin loads, bin searches, and objects till ﬁrst full bin are given in the Appendix.Another setting with less load is also given in the Appendix, and results are similar. Exact implementation details are given in the Appendix.AOL Search Logs Experiments In this section we present results with real AOL search logs. This is a dataset of user activity with 3,826,181 urls clicked, of which 607,782 are unique. We selected a wide range of conﬁgurations, see Table 2, used in practice, such as reﬂecting the 80% of internet usage being video (Cisco 2020). Servers can come and go as in practice, they will fail when overloaded, and will come back online after a certain period of time.
Researcher Affiliation	Academia	John Chen, Benjamin Coleman, Anshumali Shrivastava Department of Computer Science, Rice University, Houston, USA {johnchen, Ben.Coleman, anshumali} @rice.edu
Pseudocode	No	No pseudocode or algorithm blocks were explicitly found or labeled as such.
Open Source Code	No	The paper does not contain an explicit statement that the source code for the methodology is available, nor does it provide a link to a code repository.
Open Datasets	Yes	On the AOL search dataset and Indiana University Clicks dataset with real user activity, our proposed solution reduces cache misses by several magnitudes. In our experiments on real user logs from the AOL search dataset and Indiana University Clicks dataset (Meiss et al. 2008), the algorithm reduces cache misses by several orders of magnitude. AOL Search Logs Experiments In this section we present results with real AOL search logs. This is a dataset of user activity with 3,826,181 urls clicked, of which 607,782 are unique. Indiana University Clicks Search Logs Experiments In this section we present results using Indiana University Clicks search logs. This is a dataset of user activity, where we use the ﬁrst 1,000,000 urls clicked of which 26,062 are unique.
Dataset Splits	No	The paper mentions using 'AOL search dataset' and 'Indiana University Clicks dataset' for experiments but does not explicitly detail training, validation, or test splits. It refers to 'Simulation Results' where they 'generate n objects and k bins' and 'sweep ϵ ﬁnely between 0.1 and 3, performing 1000 trials from scratch for each ϵ', but this is not a description of dataset splits for training/validation/test.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments. It mentions 'distributed caching' and 'servers' but no specific CPU/GPU models, memory, or cloud instance types.
Software Dependencies	No	The paper does not provide specific software names with version numbers for its implementation or experiments. It mentions applications like ASP.net, Microsoft, Mozilla, Discord, Amazon Dynamo, Apache Cassandra, Google Cloud, Vimeo, but these are contexts for CH, not the software dependencies for the paper's own methodology.
Experiment Setup	No	The paper describes simulation setups with parameters like 'n objects and k bins where each bin has capacity C = n k (1 + ϵ)' and sweeping 'ϵ ﬁnely between 0.1 and 3, performing 1000 trials'. For real datasets, it provides a table of 'Distributed caching conﬁguration for AOL search dataset' (Table 2) including 'Number of servers', 'Cache size', 'Minutes for stale urls to be evicted', 'Minutes requests are served', 'Minutes for failed server to recover', and 'Number of concurrent requests till server failure'. While these are experimental settings, they are high-level system configurations rather than specific hyperparameters or model training settings like learning rate, batch size, or optimizer details typically found in machine learning papers.