Finding Influential Training Samples for Gradient Boosted Decision Trees

Authors: Boris Sharchilev, Yury Ustinovskiy, Pavel Serdyukov, Maarten Rijke

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency.
Researcher Affiliation Collaboration 1Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands 2Yandex, Moscow, Russia 3Department of Mathematics, Princeton University, Princeton, NJ, USA.
Pseudocode Yes Algorithm 1 Leaf Refit
Open Source Code Yes 1Supporting code for the paper is available at https:// github.com/bsharchilev/influence_boosting.
Open Datasets Yes The datasets used for evaluation are: (1) Adult Data Set (Adult, (dat, 1996)), (2) Amazon Employee Access Challenge dataset (Amazon, (dat, 2013)), (3) the KDD Cup 2009 Upselling dataset (Upselling, (dat, 2009)) and, for the domain mismatch experiment, (4) the Hospital Readmission dataset (Strack et al., 2014).
Dataset Splits No The paper mentions splitting training points for specific analyses and creating training sets for domain mismatch experiments, but does not provide specific train/validation/test dataset split information (percentages, counts, or explicit standard splits) for general model training or hyperparameter tuning.
Hardware Specification No The paper does not provide specific hardware details (GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No For our experiments with GBDT, we use Cat Boost(cat, 2018) an open-source implementation of GBDT by Yandex6. (This mentions Cat Boost but not a specific version number, nor other dependencies with versions.)
Experiment Setup No Dataset statistics and corresponding Cat Boost parameters can be found in the supplementary material. No specific hyperparameters or training configurations are provided directly in the main text.