Partitioned Learned Bloom Filters
Authors: Kapil Vaidya, Eric Knorr, Michael Mitzenmacher, Tim Kraska
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results from both simulated and real-world datasets show significant performance improvements from our optimization approach over both the original learned Bloom filter constructions and previously proposed heuristic improvements. |
| Researcher Affiliation | Academia | Kapil Vaidya CSAIL MIT kapilv@mit.edu Eric Knorr School of Engineering and Applied Sciences Harvard University eric.r.knorr@gmail.com Tim Kraska CSAIL MIT kraska@mit.edu Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu |
| Pseudocode | Yes | Algorithm 1 Finding optimal fpr s given thresholds; Algorithm 2 Using relaxed solution for the general problem; Algorithm 3 Solving the general problem |
| Open Source Code | No | The paper mentions using a third-party 'bloom-filter python package' but does not provide or state the release of its own source code for the described methodology. |
| Open Datasets | Yes | URLs: As in previous papers [Kraska et al. (2018), Dai & Shrivastava (2019)], we used the URL data set... EMBER: ...Ember (Endgame Malware Benchmark for Research) [Anderson & Roth (2018)] is an open source collection of 1.1M sha256 file hashes... |
| Dataset Splits | No | The paper states: 'We train the model on the entire key set and 40% of the non-key set. The thresholds and backup Bloom filters are then tuned using this model with the aim of achieving the fixed target F.' While tuning implies a validation process, it does not explicitly define a 'validation set' with specific proportions or a clear train/validation/test split. |
| Hardware Specification | Yes | All the runtime experiments in this subsection and the next subsection are measured using an 2.8GHz quad-core Intel Core i7 CPU @ 2.80GHz with 16GB of memory. |
| Software Dependencies | No | The paper mentions using 'random forest classifier from sklearn' and the 'bloom-filter python package' and that algorithms are 'implemented in Python', but it does not specify version numbers for these software components. |
| Experiment Setup | Yes | We use PLBF Alg.3 with DP algorithm discretization(N) set to 1000. We train the model on the entire key set and 40% of the non-key set. The thresholds and backup Bloom filters are then tuned using this model with the aim of achieving the fixed target F. ...We use five regions (k = 5) for both PLBF and Ada BF... |