Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Coresets for Decision Trees of Signals
Authors: Ibrahim Jubran, Ernesto Evgeniy Sanches Shayda, Ilan I Newman, Dan Feldman
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on sklearn and light GBM show that applying our coresets on real-world data-sets boosts the computation time of random forests and their parameter tuning by up to x10, while keeping similar accuracy. |
| Researcher Affiliation | Academia | Ibrahim Jubran Department of Computer Science University of Haifa, Israel EMAIL Ernesto Evgeniy Sanches Shayda Department of Computer Science University of Haifa, Israel EMAIL Ilan Newman Department of Computer Science University of Haifa, Israel EMAIL Dan Feldman Department of Computer Science University of Haifa, Israel EMAIL |
| Pseudocode | Yes | Algorithm 1: SLICEPARTITION(D, σ) ... Algorithm 2: PARTITION(D, γ, σ) ... Algorithm 3: SIGNAL-CORESET(D, k, ε) |
| Open Source Code | Yes | Open source code for our algorithms [35]. ... [35] Jubran, Ibrahim and Sanches, Ernesto and Newman, Ilan and Feldman, Dan. Open source code for the algorithms presented in this paper, 2021. Link for open-source code. |
| Open Datasets | Yes | Datasets. We used the following pair of datasets from the public UCI Machine Learning Repository [3], each of which was normalized to have zero mean and unit variance for every feature: (i): Air Quality Dataset [18] contains n = 9358 instances and m = 15 features. (ii) Gesture Phase Segmentation Dataset [45] contains n = 9900 instances and m = 18 features. |
| Dataset Splits | No | The paper mentions training and testing data but does not explicitly provide details about a separate validation split or cross-validation methodology used for hyperparameter tuning, beyond implicitly using the test set for evaluation of tuned parameters. |
| Hardware Specification | Yes | The hardware used was a standard MSI Prestige 14 laptop with an Intel Core i7-10710U and 16GB of RAM. |
| Software Dependencies | No | We implemented our coreset construction from Algorithm 3 in Python 3.7, and in this section we evaluate its empirical results, both on synthetic and real-world datasets. ... We used the following common implementations: (i): the function Random Forest Regressor from the sklearn.ensemble package, and (ii): the function LGBMRegressor from the light GBM package that implements a forest of gradient boosted trees. |
| Experiment Setup | Yes | Both functions were used with their default hyperparameters, unless states otherwise. ... To tune the hyperparameter k, we randomly generate a set K of possible values for k on a logarithmic scale. |