Trained Random Forests Completely Reveal your Dataset

Authors: Julien Ferry, Ricardo Fukasawa, Timothée Pascal, Thibaut Vidal

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through an extensive computational investigation, we demonstrate that random forests trained without bootstrap aggregation but with feature randomization are susceptible to a complete reconstruction. We rely on three popular datasets for binary classification in our experiments.
Researcher Affiliation Academia 1CIRRELT & SCALE-AI Chair in Data-Driven Supply Chains, Department of Mathematics and Industrial Engineering, Polytechnique Montr eal, Canada 2Department of Combinatorics and Optimization, University of Waterloo, Canada 3Ecole nationale des ponts et chauss ees, Paris, France.
Pseudocode No The paper describes the approach using text and mathematical formulations but does not include structured pseudocode or an algorithm block.
Open Source Code Yes Our source code is openly accessible at https://github.com/vidalt/DRAFT in the form of a user-friendly Python module named DRAFT (Dataset Reconstruction Attack From Trained ensembles), under a MIT license.
Open Datasets Yes We rely on three popular datasets for binary classification in our experiments... First, the COMPAS dataset (analyzed by Angwin et al., 2016)... Second, the UCI Adult Income dataset (Dua & Graff, 2017)... Finally, we use the Default of Credit Card Client dataset (Yeh & hui Lien, 2009)...
Dataset Splits No The paper mentions 'training set' and 'test set' but does not explicitly specify a separate 'validation set' or a split for validation data.
Hardware Specification Yes All experiments are run on a computing cluster over a set of homogeneous nodes using Intel Platinum 8260 Cascade Lake @ 2.4GHz CPU.
Software Dependencies Yes The proposed CP models described in Section 6 are solved using the OR-Tools CP-SAT solver (Perron & Didier) (v9).
Experiment Setup Yes More precisely, we use a number of trees |T | {1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100} with maximum depth dmax {None, 2, 3, 4, 5, 10} (where None stands for no maximum depth constraint). For each experiment, we randomly sample 100 examples from the entire dataset to form a training set, and use the remaining ones as a test set to verify to what extent the models generalize. We repeat the experiment five times using different seeds for the random sampling, and report the average results and their standard deviation across the five runs.