LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Authors: Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on multiple public datasets show that, Light GBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.In this section, we report the experimental results regarding our proposed Light GBM algorithm. We use five different datasets which are all publicly available.
Researcher Affiliation Collaboration 1Microsoft Research 2Peking University 3 Microsoft Redmond
Pseudocode Yes Algorithm 1: Histogram-based Algorithm; Algorithm 2: Gradient-based One-Side Sampling; Algorithm 3: Greedy Bundling; Algorithm 4: Merge Exclusive Features
Open Source Code Yes The code is available at Git Hub: https://github.com/Microsoft/Light GBM.
Open Datasets Yes We use five different datasets which are all publicly available. The details of these datasets are listed in Table 1. Among them, the Microsoft Learning to Rank (LETOR) [26] dataset contains 30K web search queries. The features used in this dataset are mostly dense numerical features. The Allstate Insurance Claim [27] and the Flight Delay [28] datasets both contain a lot of one-hot coding features. And the last two datasets are from KDD CUP 2010 and KDD CUP 2012.
Dataset Splits No No specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology for train/validation/test) was found. While it mentions 'All algorithms are run for fixed iterations, and we get the accuracy results from the iteration with the best score,' implying some form of validation, the validation split itself is not explicitly defined.
Hardware Specification Yes Our experimental environment is a Linux server with two E5-2670 v3 CPUs (in total 24 cores) and 256GB memories. All experiments run with multi-threading and the number of threads is fixed to 16.
Software Dependencies No The paper mentions software like XGBoost, scikit-learn, and gbm in R, but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes We set a = 0.05, b = 0.05 for Allstate, KDD10 and KDD12, and set a = 0.1, b = 0.1 for Flight Delay and LETOR. We set γ = 0 in EFB. All algorithms are run for fixed iterations, and we get the accuracy results from the iteration with the best score.