Partitioned Learned Bloom Filters

Authors: Kapil Vaidya, Eric Knorr, Michael Mitzenmacher, Tim Kraska

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results from both simulated and real-world datasets show significant performance improvements from our optimization approach over both the original learned Bloom filter constructions and previously proposed heuristic improvements.
Researcher Affiliation Academia Kapil Vaidya CSAIL MIT kapilv@mit.edu Eric Knorr School of Engineering and Applied Sciences Harvard University eric.r.knorr@gmail.com Tim Kraska CSAIL MIT kraska@mit.edu Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu
Pseudocode Yes Algorithm 1 Finding optimal fpr s given thresholds; Algorithm 2 Using relaxed solution for the general problem; Algorithm 3 Solving the general problem
Open Source Code No The paper mentions using a third-party 'bloom-filter python package' but does not provide or state the release of its own source code for the described methodology.
Open Datasets Yes URLs: As in previous papers [Kraska et al. (2018), Dai & Shrivastava (2019)], we used the URL data set... EMBER: ...Ember (Endgame Malware Benchmark for Research) [Anderson & Roth (2018)] is an open source collection of 1.1M sha256 file hashes...
Dataset Splits No The paper states: 'We train the model on the entire key set and 40% of the non-key set. The thresholds and backup Bloom filters are then tuned using this model with the aim of achieving the fixed target F.' While tuning implies a validation process, it does not explicitly define a 'validation set' with specific proportions or a clear train/validation/test split.
Hardware Specification Yes All the runtime experiments in this subsection and the next subsection are measured using an 2.8GHz quad-core Intel Core i7 CPU @ 2.80GHz with 16GB of memory.
Software Dependencies No The paper mentions using 'random forest classifier from sklearn' and the 'bloom-filter python package' and that algorithms are 'implemented in Python', but it does not specify version numbers for these software components.
Experiment Setup Yes We use PLBF Alg.3 with DP algorithm discretization(N) set to 1000. We train the model on the entire key set and 40% of the non-key set. The thresholds and backup Bloom filters are then tuned using this model with the aim of achieving the fixed target F. ...We use five regions (k = 5) for both PLBF and Ada BF...