Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fast Partitioned Learned Bloom Filter
Authors: Atsuki Sato, Yusuke Matsui
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results from real-world datasets show that (i) fast PLBF and fast PLBF++ can construct the data structure up to 233 and 761 times faster than PLBF, (ii) fast PLBF can achieve the same memory efficiency as PLBF, and (iii) fast PLBF++ can achieve almost the same memory efficiency as PLBF. |
| Researcher Affiliation | Academia | Atsuki Sato Yusuke Matsui The University of Tokyo Tokyo, Japan EMAIL EMAIL |
| Pseudocode | Yes | The pseudo-code for PLBF construction is provided in the appendix. (Referring to Algorithm 1, 2, 3, 4, 5 in the appendix) |
| Open Source Code | Yes | The codes are available at https://github.com/atsukisato/Fast PLBF. |
| Open Datasets | Yes | Malicious URLs Dataset: As in previous papers [11, 14], we used Malicious URLs Dataset [17]. ...[17] Manu Siddhartha. Malicious urls dataset | kaggle. URL https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset [Online; accessed 22-December-2022], 2021. EMBER Dataset: We used the EMBER dataset [18] as in the PLBF research. ...[18] Hyrum S Anderson and Phil Roth. Ember: an open dataset for training static pe malware machine learning models. ar Xiv preprint ar Xiv:1804.04637, 2018. |
| Dataset Splits | No | We used all malicious URLs and 342,482 (80%) benign URLs as the training set, and the remaining benign URLs as the test set. ... We used all malicious files and 300,000 (75%) benign files as the train set and the remaining benign files as the test set. The paper specifies training and test sets but does not explicitly mention a separate validation set split. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (CPU, GPU models, memory, etc.) used for running the experiments. |
| Software Dependencies | No | While any model can be used for the classifier, we used Light GBM [19] because of its speed in training and inference, as well as its memory efficiency and accuracy. The paper mentions Light GBM but does not provide a specific version number for it or any other software dependency. |
| Experiment Setup | Yes | Following the experiments in the PLBF paper, hyperparameters for PLBF, fast PLBF, and fast PLBF++ were set to N = 1, 000 and k = 5. ... The memory size is specified by the user, and N and k are hyperparameters that are determined by balancing construction time and accuracy. |