Trained Random Forests Completely Reveal your Dataset
Authors: Julien Ferry, Ricardo Fukasawa, Timothée Pascal, Thibaut Vidal
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through an extensive computational investigation, we demonstrate that random forests trained without bootstrap aggregation but with feature randomization are susceptible to a complete reconstruction. We rely on three popular datasets for binary classification in our experiments. |
| Researcher Affiliation | Academia | 1CIRRELT & SCALE-AI Chair in Data-Driven Supply Chains, Department of Mathematics and Industrial Engineering, Polytechnique Montr eal, Canada 2Department of Combinatorics and Optimization, University of Waterloo, Canada 3Ecole nationale des ponts et chauss ees, Paris, France. |
| Pseudocode | No | The paper describes the approach using text and mathematical formulations but does not include structured pseudocode or an algorithm block. |
| Open Source Code | Yes | Our source code is openly accessible at https://github.com/vidalt/DRAFT in the form of a user-friendly Python module named DRAFT (Dataset Reconstruction Attack From Trained ensembles), under a MIT license. |
| Open Datasets | Yes | We rely on three popular datasets for binary classification in our experiments... First, the COMPAS dataset (analyzed by Angwin et al., 2016)... Second, the UCI Adult Income dataset (Dua & Graff, 2017)... Finally, we use the Default of Credit Card Client dataset (Yeh & hui Lien, 2009)... |
| Dataset Splits | No | The paper mentions 'training set' and 'test set' but does not explicitly specify a separate 'validation set' or a split for validation data. |
| Hardware Specification | Yes | All experiments are run on a computing cluster over a set of homogeneous nodes using Intel Platinum 8260 Cascade Lake @ 2.4GHz CPU. |
| Software Dependencies | Yes | The proposed CP models described in Section 6 are solved using the OR-Tools CP-SAT solver (Perron & Didier) (v9). |
| Experiment Setup | Yes | More precisely, we use a number of trees |T | {1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100} with maximum depth dmax {None, 2, 3, 4, 5, 10} (where None stands for no maximum depth constraint). For each experiment, we randomly sample 100 examples from the entire dataset to form a training set, and use the remaining ones as a test set to verify to what extent the models generalize. We repeat the experiment five times using different seeds for the random sampling, and report the average results and their standard deviation across the five runs. |