reproducibilityindex.ai

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Authors: Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on multiple public datasets show that, Light GBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.In this section, we report the experimental results regarding our proposed Light GBM algorithm. We use five different datasets which are all publicly available.
Researcher Affiliation	Collaboration	1Microsoft Research 2Peking University 3 Microsoft Redmond
Pseudocode	Yes	Algorithm 1: Histogram-based Algorithm; Algorithm 2: Gradient-based One-Side Sampling; Algorithm 3: Greedy Bundling; Algorithm 4: Merge Exclusive Features
Open Source Code	Yes	The code is available at Git Hub: https://github.com/Microsoft/Light GBM.
Open Datasets	Yes	We use five different datasets which are all publicly available. The details of these datasets are listed in Table 1. Among them, the Microsoft Learning to Rank (LETOR) [26] dataset contains 30K web search queries. The features used in this dataset are mostly dense numerical features. The Allstate Insurance Claim [27] and the Flight Delay [28] datasets both contain a lot of one-hot coding features. And the last two datasets are from KDD CUP 2010 and KDD CUP 2012.
Dataset Splits	No	No specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology for train/validation/test) was found. While it mentions 'All algorithms are run for fixed iterations, and we get the accuracy results from the iteration with the best score,' implying some form of validation, the validation split itself is not explicitly defined.
Hardware Specification	Yes	Our experimental environment is a Linux server with two E5-2670 v3 CPUs (in total 24 cores) and 256GB memories. All experiments run with multi-threading and the number of threads is fixed to 16.
Software Dependencies	No	The paper mentions software like XGBoost, scikit-learn, and gbm in R, but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup	Yes	We set a = 0.05, b = 0.05 for Allstate, KDD10 and KDD12, and set a = 0.1, b = 0.1 for Flight Delay and LETOR. We set γ = 0 in EFB. All algorithms are run for fixed iterations, and we get the accuracy results from the iteration with the best score.