LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Authors: Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on multiple public datasets show that, Light GBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.In this section, we report the experimental results regarding our proposed Light GBM algorithm. We use five different datasets which are all publicly available. |
| Researcher Affiliation | Collaboration | 1Microsoft Research 2Peking University 3 Microsoft Redmond |
| Pseudocode | Yes | Algorithm 1: Histogram-based Algorithm; Algorithm 2: Gradient-based One-Side Sampling; Algorithm 3: Greedy Bundling; Algorithm 4: Merge Exclusive Features |
| Open Source Code | Yes | The code is available at Git Hub: https://github.com/Microsoft/Light GBM. |
| Open Datasets | Yes | We use five different datasets which are all publicly available. The details of these datasets are listed in Table 1. Among them, the Microsoft Learning to Rank (LETOR) [26] dataset contains 30K web search queries. The features used in this dataset are mostly dense numerical features. The Allstate Insurance Claim [27] and the Flight Delay [28] datasets both contain a lot of one-hot coding features. And the last two datasets are from KDD CUP 2010 and KDD CUP 2012. |
| Dataset Splits | No | No specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology for train/validation/test) was found. While it mentions 'All algorithms are run for fixed iterations, and we get the accuracy results from the iteration with the best score,' implying some form of validation, the validation split itself is not explicitly defined. |
| Hardware Specification | Yes | Our experimental environment is a Linux server with two E5-2670 v3 CPUs (in total 24 cores) and 256GB memories. All experiments run with multi-threading and the number of threads is fixed to 16. |
| Software Dependencies | No | The paper mentions software like XGBoost, scikit-learn, and gbm in R, but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | We set a = 0.05, b = 0.05 for Allstate, KDD10 and KDD12, and set a = 0.1, b = 0.1 for Flight Delay and LETOR. We set γ = 0 in EFB. All algorithms are run for fixed iterations, and we get the accuracy results from the iteration with the best score. |