MonoForest framework for tree ensemble analysis
Authors: Igor Kuralenok, Vasilii Ershov, Igor Labutin
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, it shows comparable results with state-of-the-art interpretation techniques. Another application of the framework is the ensemble-wise pruning: we can drop monomials from the polynomial, based on train data statistics. This way we reduce the model size up to 3 times without loss of its quality. For our experiment we have used model built by Light GBM for Higgs dataset, transferred it to an ensemble of symmetric trees by Algorithm 1 and then compared model execution times of the original model and transformed version. Obtained ROC AUC values are presented in the second (original model) and third (pruned model) columns of the Table 1. The experimental results allow us to claim that the tree ensemble can be significantly reduced without loss of model quality in a variety of practical tasks. To demonstrate the quality of the proposed approach we built a decision tree ensemble binary one-vs-rest classifier for each class of the MNIST dataset using Cat Boost and analyzed each classifier using three methods: Mono Forest, SHAP, and permutation based Model Reliance, proposed by Fisher et al. (8). |
| Researcher Affiliation | Collaboration | Igor Kuralenok Yandex / Jet Brains Research solar@yandex-team.ru Vasily Ershov Yandex noxoomo@yandex-team.ru Igor Labutin Yandex / SPb HSE Labutin.Igor L@gmail.com |
| Pseudocode | Yes | Algorithm 1: Greedy ensemble composition algorithm. |
| Open Source Code | No | No explicit statement about releasing source code or a link to a code repository for the described methodology is provided in the paper. |
| Open Datasets | No | The paper mentions 'Higgs dataset' and 'MNIST dataset' and 'publicly-available binary classification datasets' but does not provide a specific link, DOI, repository name, or a formal citation with author and year to access these datasets. |
| Dataset Splits | No | The paper mentions 'we split the data into train/validate/test groups' and 'parameters were tuned on train/validation pair' but does not provide specific percentages, sample counts, or a citation to predefined splits for the validation set. |
| Hardware Specification | Yes | For experiment we have used dual-socket server with Intel Xeon CPU E5-2650 and 256GB of RAM. |
| Software Dependencies | Yes | Cat Boost version was equal to 0.14.2 |
| Experiment Setup | No | The paper states, 'It important to note that we tuned the optimal gradient step, number of trees in the ensemble, regularization parameters of a single trees, etc.' but does not provide the specific values for these hyperparameters or detailed system-level training settings in the main text. |