Privacy-Preserving Gradient Boosting Decision Trees

Authors: Qinbin Li, Zhaomin Wu, Zeyi Wen, Bingsheng He784-791

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the effectiveness and efficiency of DPBoost. We compare DPBoost with three other approaches: 1) NP (the vanilla GBDT): Train GBDTs without privacy concerns. 2) PARA: A recent approach (Zhao et al. 2018) that adopts parallel composition to train multiple trees, and uses only a half of unused instances when training a differentially private tree. 3) SEQ: we extend the previous approach on decision trees (Liu et al. 2018) that aggregates differentially private decision trees using sequential composition. [...] We implemented DPBoost based on Light GBM1. Our experiments are conducted on a machine with one Xeon W2155 10 core CPU. We use 10 public datasets in our evaluation.
Researcher Affiliation Academia Qinbin Li,1 Zhaomin Wu,1 Zeyi Wen,2 Bingsheng He1 1National University of Singapore 2The University of Western Australia {qinbin, zhaomin, hebs}@comp.nus.edu.sg, zeyi.wen@uwa.edu.au
Pseudocode Yes Algorithm 1: Train Single Tree: Train a differentially private decision tree
Open Source Code No We have implemented our approach (named DPBoost) based on a popular library called Light GBM (Ke et al. 2017). Our experimental results show that DPBoost is much superior to the other approaches and can achieve competitive performance compared with the ordinary Light GBM. [...] 1https://github.com/microsoft/Light GBM
Open Datasets Yes We use 10 public datasets in our evaluation. The details of the datasets are summarized in Table 1. There are eight real-world datasets and two synthetic datasets (i.e., synthetic cls and synthetic reg). The realworld datasets are available from the LIBSVM website2. The synthetic datasets are generated using scikit-learn3 (Pedregosa et al. 2011). [...] 2https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ 3https://scikit-learn.org/stable/datasets/index.html# sample-generators
Dataset Splits Yes We use 5-fold cross-validation for model evaluation.
Hardware Specification Yes Our experiments are conducted on a machine with one Xeon W2155 10 core CPU.
Software Dependencies No The paper mentions using 'Light GBM' and 'scikit-learn' but does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes The maximum depth is set to 6. The regularization parameter λ is set to 0.1. The threshold g l is set to 1. We use 5-fold cross-validation for model evaluation. The number of trees inside an ensemble is set to 50 in DPBoost. [...] The privacy budget for each ensemble is set to 5. For fairness, the total privacy budget for SEQ and PARA is set to 100 to achieve the same privacy level as DPBoost.