Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MonoForest framework for tree ensemble analysis
Authors: Igor Kuralenok, Vasilii Ershov, Igor Labutin
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, it shows comparable results with state-of-the-art interpretation techniques. Another application of the framework is the ensemble-wise pruning: we can drop monomials from the polynomial, based on train data statistics. This way we reduce the model size up to 3 times without loss of its quality. For our experiment we have used model built by Light GBM for Higgs dataset, transferred it to an ensemble of symmetric trees by Algorithm 1 and then compared model execution times of the original model and transformed version. Obtained ROC AUC values are presented in the second (original model) and third (pruned model) columns of the Table 1. The experimental results allow us to claim that the tree ensemble can be significantly reduced without loss of model quality in a variety of practical tasks. To demonstrate the quality of the proposed approach we built a decision tree ensemble binary one-vs-rest classifier for each class of the MNIST dataset using Cat Boost and analyzed each classifier using three methods: Mono Forest, SHAP, and permutation based Model Reliance, proposed by Fisher et al. (8). |
| Researcher Affiliation | Collaboration | Igor Kuralenok Yandex / Jet Brains Research EMAIL Vasily Ershov Yandex EMAIL Igor Labutin Yandex / SPb HSE Labutin.Igor EMAIL |
| Pseudocode | Yes | Algorithm 1: Greedy ensemble composition algorithm. |
| Open Source Code | No | No explicit statement about releasing source code or a link to a code repository for the described methodology is provided in the paper. |
| Open Datasets | No | The paper mentions 'Higgs dataset' and 'MNIST dataset' and 'publicly-available binary classification datasets' but does not provide a specific link, DOI, repository name, or a formal citation with author and year to access these datasets. |
| Dataset Splits | No | The paper mentions 'we split the data into train/validate/test groups' and 'parameters were tuned on train/validation pair' but does not provide specific percentages, sample counts, or a citation to predefined splits for the validation set. |
| Hardware Specification | Yes | For experiment we have used dual-socket server with Intel Xeon CPU E5-2650 and 256GB of RAM. |
| Software Dependencies | Yes | Cat Boost version was equal to 0.14.2 |
| Experiment Setup | No | The paper states, 'It important to note that we tuned the optimal gradient step, number of trees in the ensemble, regularization parameters of a single trees, etc.' but does not provide the specific values for these hyperparameters or detailed system-level training settings in the main text. |